Ignore Warnings¶

In [2]:
import warnings
warnings.filterwarnings("ignore")

Importing required libraries¶

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max.rows',130)
pd.set_option('display.max.columns',130)
pd.set_option('float_format', '{:.2f}'.format)

Importing the dataset¶

In [4]:
df = pd.read_csv("application_data.csv")
In [4]:
# Checking few records from the dataframe
df.head()
Out[4]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.00 406597.50 24700.50 351000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.02 -9461 -637 -3648.00 -2120 NaN 1 1 0 1 1 0 Laborers 1.00 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.08 0.26 0.14 0.02 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.63 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 reg oper account block of flats 0.01 Stone, brick No 2.00 2.00 2.00 2.00 -1134.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00
1 100003 0 Cash loans F N N 0 270000.00 1293502.50 35698.50 1129500.00 Family State servant Higher education Married House / apartment 0.00 -16765 -1188 -1186.00 -291 NaN 1 1 0 1 1 0 Core staff 2.00 1 1 MONDAY 11 0 0 0 0 0 0 School 0.31 0.62 NaN 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.05 0.00 0.01 0.09 0.05 0.99 0.80 0.05 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.00 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.01 reg oper account block of flats 0.07 Block No 1.00 0.00 1.00 0.00 -828.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
2 100004 0 Revolving loans M Y Y 0 67500.00 135000.00 6750.00 135000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.01 -19046 -225 -4260.00 -2531 26.00 1 1 1 1 1 0 Laborers 1.00 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.56 0.73 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -815.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
3 100006 0 Cash loans F N Y 0 135000.00 312682.50 29686.50 297000.00 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.01 -19005 -3039 -9833.00 -2437 NaN 1 1 0 1 0 0 Laborers 2.00 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 0.00 2.00 0.00 -617.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.00 513000.00 21865.50 513000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.03 -19932 -3038 -4311.00 -3458 NaN 1 1 0 1 0 0 Core staff 1.00 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1106.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00

Check structure of the data¶

In [5]:
df.info(verbose = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
 #    Column                        Dtype  
---   ------                        -----  
 0    SK_ID_CURR                    int64  
 1    TARGET                        int64  
 2    NAME_CONTRACT_TYPE            object 
 3    CODE_GENDER                   object 
 4    FLAG_OWN_CAR                  object 
 5    FLAG_OWN_REALTY               object 
 6    CNT_CHILDREN                  int64  
 7    AMT_INCOME_TOTAL              float64
 8    AMT_CREDIT                    float64
 9    AMT_ANNUITY                   float64
 10   AMT_GOODS_PRICE               float64
 11   NAME_TYPE_SUITE               object 
 12   NAME_INCOME_TYPE              object 
 13   NAME_EDUCATION_TYPE           object 
 14   NAME_FAMILY_STATUS            object 
 15   NAME_HOUSING_TYPE             object 
 16   REGION_POPULATION_RELATIVE    float64
 17   DAYS_BIRTH                    int64  
 18   DAYS_EMPLOYED                 int64  
 19   DAYS_REGISTRATION             float64
 20   DAYS_ID_PUBLISH               int64  
 21   OWN_CAR_AGE                   float64
 22   FLAG_MOBIL                    int64  
 23   FLAG_EMP_PHONE                int64  
 24   FLAG_WORK_PHONE               int64  
 25   FLAG_CONT_MOBILE              int64  
 26   FLAG_PHONE                    int64  
 27   FLAG_EMAIL                    int64  
 28   OCCUPATION_TYPE               object 
 29   CNT_FAM_MEMBERS               float64
 30   REGION_RATING_CLIENT          int64  
 31   REGION_RATING_CLIENT_W_CITY   int64  
 32   WEEKDAY_APPR_PROCESS_START    object 
 33   HOUR_APPR_PROCESS_START       int64  
 34   REG_REGION_NOT_LIVE_REGION    int64  
 35   REG_REGION_NOT_WORK_REGION    int64  
 36   LIVE_REGION_NOT_WORK_REGION   int64  
 37   REG_CITY_NOT_LIVE_CITY        int64  
 38   REG_CITY_NOT_WORK_CITY        int64  
 39   LIVE_CITY_NOT_WORK_CITY       int64  
 40   ORGANIZATION_TYPE             object 
 41   EXT_SOURCE_1                  float64
 42   EXT_SOURCE_2                  float64
 43   EXT_SOURCE_3                  float64
 44   APARTMENTS_AVG                float64
 45   BASEMENTAREA_AVG              float64
 46   YEARS_BEGINEXPLUATATION_AVG   float64
 47   YEARS_BUILD_AVG               float64
 48   COMMONAREA_AVG                float64
 49   ELEVATORS_AVG                 float64
 50   ENTRANCES_AVG                 float64
 51   FLOORSMAX_AVG                 float64
 52   FLOORSMIN_AVG                 float64
 53   LANDAREA_AVG                  float64
 54   LIVINGAPARTMENTS_AVG          float64
 55   LIVINGAREA_AVG                float64
 56   NONLIVINGAPARTMENTS_AVG       float64
 57   NONLIVINGAREA_AVG             float64
 58   APARTMENTS_MODE               float64
 59   BASEMENTAREA_MODE             float64
 60   YEARS_BEGINEXPLUATATION_MODE  float64
 61   YEARS_BUILD_MODE              float64
 62   COMMONAREA_MODE               float64
 63   ELEVATORS_MODE                float64
 64   ENTRANCES_MODE                float64
 65   FLOORSMAX_MODE                float64
 66   FLOORSMIN_MODE                float64
 67   LANDAREA_MODE                 float64
 68   LIVINGAPARTMENTS_MODE         float64
 69   LIVINGAREA_MODE               float64
 70   NONLIVINGAPARTMENTS_MODE      float64
 71   NONLIVINGAREA_MODE            float64
 72   APARTMENTS_MEDI               float64
 73   BASEMENTAREA_MEDI             float64
 74   YEARS_BEGINEXPLUATATION_MEDI  float64
 75   YEARS_BUILD_MEDI              float64
 76   COMMONAREA_MEDI               float64
 77   ELEVATORS_MEDI                float64
 78   ENTRANCES_MEDI                float64
 79   FLOORSMAX_MEDI                float64
 80   FLOORSMIN_MEDI                float64
 81   LANDAREA_MEDI                 float64
 82   LIVINGAPARTMENTS_MEDI         float64
 83   LIVINGAREA_MEDI               float64
 84   NONLIVINGAPARTMENTS_MEDI      float64
 85   NONLIVINGAREA_MEDI            float64
 86   FONDKAPREMONT_MODE            object 
 87   HOUSETYPE_MODE                object 
 88   TOTALAREA_MODE                float64
 89   WALLSMATERIAL_MODE            object 
 90   EMERGENCYSTATE_MODE           object 
 91   OBS_30_CNT_SOCIAL_CIRCLE      float64
 92   DEF_30_CNT_SOCIAL_CIRCLE      float64
 93   OBS_60_CNT_SOCIAL_CIRCLE      float64
 94   DEF_60_CNT_SOCIAL_CIRCLE      float64
 95   DAYS_LAST_PHONE_CHANGE        float64
 96   FLAG_DOCUMENT_2               int64  
 97   FLAG_DOCUMENT_3               int64  
 98   FLAG_DOCUMENT_4               int64  
 99   FLAG_DOCUMENT_5               int64  
 100  FLAG_DOCUMENT_6               int64  
 101  FLAG_DOCUMENT_7               int64  
 102  FLAG_DOCUMENT_8               int64  
 103  FLAG_DOCUMENT_9               int64  
 104  FLAG_DOCUMENT_10              int64  
 105  FLAG_DOCUMENT_11              int64  
 106  FLAG_DOCUMENT_12              int64  
 107  FLAG_DOCUMENT_13              int64  
 108  FLAG_DOCUMENT_14              int64  
 109  FLAG_DOCUMENT_15              int64  
 110  FLAG_DOCUMENT_16              int64  
 111  FLAG_DOCUMENT_17              int64  
 112  FLAG_DOCUMENT_18              int64  
 113  FLAG_DOCUMENT_19              int64  
 114  FLAG_DOCUMENT_20              int64  
 115  FLAG_DOCUMENT_21              int64  
 116  AMT_REQ_CREDIT_BUREAU_HOUR    float64
 117  AMT_REQ_CREDIT_BUREAU_DAY     float64
 118  AMT_REQ_CREDIT_BUREAU_WEEK    float64
 119  AMT_REQ_CREDIT_BUREAU_MON     float64
 120  AMT_REQ_CREDIT_BUREAU_QRT     float64
 121  AMT_REQ_CREDIT_BUREAU_YEAR    float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB

We do not see any columns with Nullable values

In [6]:
df.shape
Out[6]:
(307511, 122)

There are ~307k rows and 122 columns

Get statistical summary for numerical variables¶

In [7]:
df.describe()
Out[7]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 307511.00 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 168797.92 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.00 0.82 0.20 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 237123.15 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.00 0.38 0.40 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.00 0.06 0.00 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 25650.00 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 112500.00 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 147150.00 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 202500.00 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.00 1.00 0.00 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 117000000.00 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.00 1.00 1.00 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00

Analyzing categorical variables¶

In [8]:
df.select_dtypes(include = "object").columns
Out[8]:
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')
In [9]:
# Checking number of categorical variables
len(df.select_dtypes(include = "object").columns)
Out[9]:
16

There are 16 categorical variables

Analyzing numerical variables¶

In [10]:
df.select_dtypes(include=["int64","float64"]).columns
Out[10]:
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=106)
In [11]:
# Checking number of categorical variables
len(df.select_dtypes(include=["int64","float64"]).columns)
Out[11]:
106

There are 106 numerical variables

In [12]:
df.select_dtypes(include=["int64","float64"])
Out[12]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 0 202500.00 406597.50 24700.50 351000.00 0.02 -9461 -637 -3648.00 -2120 NaN 1 1 0 1 1 0 1.00 2 2 10 0 0 0 0 0 0 0.08 0.26 0.14 0.02 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.63 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.01 2.00 2.00 2.00 2.00 -1134.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00
1 100003 0 0 270000.00 1293502.50 35698.50 1129500.00 0.00 -16765 -1188 -1186.00 -291 NaN 1 1 0 1 1 0 2.00 1 1 11 0 0 0 0 0 0 0.31 0.62 NaN 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.05 0.00 0.01 0.09 0.05 0.99 0.80 0.05 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.00 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.01 0.07 1.00 0.00 1.00 0.00 -828.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
2 100004 0 0 67500.00 135000.00 6750.00 135000.00 0.01 -19046 -225 -4260.00 -2531 26.00 1 1 1 1 1 0 1.00 2 2 9 0 0 0 0 0 0 NaN 0.56 0.73 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -815.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
3 100006 0 0 135000.00 312682.50 29686.50 297000.00 0.01 -19005 -3039 -9833.00 -2437 NaN 1 1 0 1 0 0 2.00 2 2 17 0 0 0 0 0 0 NaN 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 0.00 2.00 0.00 -617.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 0 121500.00 513000.00 21865.50 513000.00 0.03 -19932 -3038 -4311.00 -3458 NaN 1 1 0 1 0 0 1.00 2 2 11 0 0 0 0 1 1 NaN 0.32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1106.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 0 0 157500.00 254700.00 27558.00 225000.00 0.03 -9327 -236 -8456.00 -1982 NaN 1 1 0 1 0 0 1.00 1 1 15 0 0 0 0 0 0 0.15 0.68 NaN 0.20 0.09 0.99 0.83 0.02 0.22 0.10 0.60 0.27 0.06 0.15 0.20 0.08 0.11 0.10 0.02 0.98 0.71 0.02 0.08 0.03 0.46 0.04 0.01 0.09 0.09 0.00 0.01 0.20 0.09 0.99 0.83 0.02 0.22 0.10 0.60 0.27 0.06 0.15 0.20 0.08 0.11 0.29 0.00 0.00 0.00 0.00 -273.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
307507 456252 0 0 72000.00 269550.00 12001.50 225000.00 0.03 -20775 365243 -4388.00 -4090 NaN 1 0 0 1 1 0 1.00 2 2 8 0 0 0 0 0 0 NaN 0.12 NaN 0.02 0.04 0.97 0.63 0.00 0.00 0.10 0.08 0.12 0.06 0.02 0.03 0.00 0.00 0.03 0.05 0.97 0.64 0.00 0.00 0.10 0.08 0.12 0.06 0.02 0.03 0.00 0.00 0.03 0.04 0.97 0.63 0.00 0.00 0.10 0.08 0.12 0.06 0.02 0.03 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
307508 456253 0 0 153000.00 677664.00 29979.00 585000.00 0.01 -14966 -7921 -6737.00 -5150 NaN 1 1 0 1 0 1 1.00 3 3 9 0 0 0 0 1 1 0.74 0.54 0.22 0.10 0.09 0.98 0.75 0.01 0.00 0.21 0.17 0.21 NaN 0.08 0.93 0.00 0.00 0.10 0.09 0.98 0.76 0.01 0.00 0.21 0.17 0.21 NaN 0.09 0.97 0.00 0.00 0.10 0.09 0.98 0.75 0.01 0.00 0.21 0.17 0.21 NaN 0.09 0.94 0.00 0.00 0.80 6.00 0.00 6.00 0.00 -1909.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.00 0.00 0.00 1.00 0.00 1.00
307509 456254 1 0 171000.00 370107.00 20205.00 319500.00 0.01 -11961 -4786 -2562.00 -931 NaN 1 1 0 1 0 0 2.00 2 2 9 0 0 0 1 1 0 NaN 0.51 0.66 0.01 NaN 0.98 NaN NaN NaN 0.07 0.04 NaN NaN NaN 0.01 NaN NaN 0.01 NaN 0.98 NaN NaN NaN 0.07 0.04 NaN NaN NaN 0.01 NaN NaN 0.01 NaN 0.98 NaN NaN NaN 0.07 0.04 NaN NaN NaN 0.01 NaN NaN 0.01 0.00 0.00 0.00 0.00 -322.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
307510 456255 0 0 157500.00 675000.00 49117.50 675000.00 0.05 -16856 -1262 -5128.00 -410 NaN 1 1 1 1 1 0 2.00 1 1 20 0 0 0 0 1 1 0.73 0.71 0.11 0.07 0.05 0.99 NaN 0.02 0.08 0.07 0.38 NaN NaN NaN 0.08 NaN 0.00 0.08 0.05 0.99 NaN 0.02 0.08 0.07 0.38 NaN NaN NaN 0.08 NaN 0.00 0.07 0.05 0.99 NaN 0.02 0.08 0.07 0.38 NaN NaN NaN 0.08 NaN 0.00 0.07 0.00 0.00 0.00 0.00 -787.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 2.00 0.00 1.00

307511 rows × 106 columns

Dealing with incorrect data types¶

Check if we have any column with incorrect data type

In [13]:
df.dtypes
Out[13]:
SK_ID_CURR                        int64
TARGET                            int64
NAME_CONTRACT_TYPE               object
CODE_GENDER                      object
FLAG_OWN_CAR                     object
FLAG_OWN_REALTY                  object
CNT_CHILDREN                      int64
AMT_INCOME_TOTAL                float64
AMT_CREDIT                      float64
AMT_ANNUITY                     float64
AMT_GOODS_PRICE                 float64
NAME_TYPE_SUITE                  object
NAME_INCOME_TYPE                 object
NAME_EDUCATION_TYPE              object
NAME_FAMILY_STATUS               object
NAME_HOUSING_TYPE                object
REGION_POPULATION_RELATIVE      float64
DAYS_BIRTH                        int64
DAYS_EMPLOYED                     int64
DAYS_REGISTRATION               float64
DAYS_ID_PUBLISH                   int64
OWN_CAR_AGE                     float64
FLAG_MOBIL                        int64
FLAG_EMP_PHONE                    int64
FLAG_WORK_PHONE                   int64
FLAG_CONT_MOBILE                  int64
FLAG_PHONE                        int64
FLAG_EMAIL                        int64
OCCUPATION_TYPE                  object
CNT_FAM_MEMBERS                 float64
REGION_RATING_CLIENT              int64
REGION_RATING_CLIENT_W_CITY       int64
WEEKDAY_APPR_PROCESS_START       object
HOUR_APPR_PROCESS_START           int64
REG_REGION_NOT_LIVE_REGION        int64
REG_REGION_NOT_WORK_REGION        int64
LIVE_REGION_NOT_WORK_REGION       int64
REG_CITY_NOT_LIVE_CITY            int64
REG_CITY_NOT_WORK_CITY            int64
LIVE_CITY_NOT_WORK_CITY           int64
ORGANIZATION_TYPE                object
EXT_SOURCE_1                    float64
EXT_SOURCE_2                    float64
EXT_SOURCE_3                    float64
APARTMENTS_AVG                  float64
BASEMENTAREA_AVG                float64
YEARS_BEGINEXPLUATATION_AVG     float64
YEARS_BUILD_AVG                 float64
COMMONAREA_AVG                  float64
ELEVATORS_AVG                   float64
ENTRANCES_AVG                   float64
FLOORSMAX_AVG                   float64
FLOORSMIN_AVG                   float64
LANDAREA_AVG                    float64
LIVINGAPARTMENTS_AVG            float64
LIVINGAREA_AVG                  float64
NONLIVINGAPARTMENTS_AVG         float64
NONLIVINGAREA_AVG               float64
APARTMENTS_MODE                 float64
BASEMENTAREA_MODE               float64
YEARS_BEGINEXPLUATATION_MODE    float64
YEARS_BUILD_MODE                float64
COMMONAREA_MODE                 float64
ELEVATORS_MODE                  float64
ENTRANCES_MODE                  float64
FLOORSMAX_MODE                  float64
FLOORSMIN_MODE                  float64
LANDAREA_MODE                   float64
LIVINGAPARTMENTS_MODE           float64
LIVINGAREA_MODE                 float64
NONLIVINGAPARTMENTS_MODE        float64
NONLIVINGAREA_MODE              float64
APARTMENTS_MEDI                 float64
BASEMENTAREA_MEDI               float64
YEARS_BEGINEXPLUATATION_MEDI    float64
YEARS_BUILD_MEDI                float64
COMMONAREA_MEDI                 float64
ELEVATORS_MEDI                  float64
ENTRANCES_MEDI                  float64
FLOORSMAX_MEDI                  float64
FLOORSMIN_MEDI                  float64
LANDAREA_MEDI                   float64
LIVINGAPARTMENTS_MEDI           float64
LIVINGAREA_MEDI                 float64
NONLIVINGAPARTMENTS_MEDI        float64
NONLIVINGAREA_MEDI              float64
FONDKAPREMONT_MODE               object
HOUSETYPE_MODE                   object
TOTALAREA_MODE                  float64
WALLSMATERIAL_MODE               object
EMERGENCYSTATE_MODE              object
OBS_30_CNT_SOCIAL_CIRCLE        float64
DEF_30_CNT_SOCIAL_CIRCLE        float64
OBS_60_CNT_SOCIAL_CIRCLE        float64
DEF_60_CNT_SOCIAL_CIRCLE        float64
DAYS_LAST_PHONE_CHANGE          float64
FLAG_DOCUMENT_2                   int64
FLAG_DOCUMENT_3                   int64
FLAG_DOCUMENT_4                   int64
FLAG_DOCUMENT_5                   int64
FLAG_DOCUMENT_6                   int64
FLAG_DOCUMENT_7                   int64
FLAG_DOCUMENT_8                   int64
FLAG_DOCUMENT_9                   int64
FLAG_DOCUMENT_10                  int64
FLAG_DOCUMENT_11                  int64
FLAG_DOCUMENT_12                  int64
FLAG_DOCUMENT_13                  int64
FLAG_DOCUMENT_14                  int64
FLAG_DOCUMENT_15                  int64
FLAG_DOCUMENT_16                  int64
FLAG_DOCUMENT_17                  int64
FLAG_DOCUMENT_18                  int64
FLAG_DOCUMENT_19                  int64
FLAG_DOCUMENT_20                  int64
FLAG_DOCUMENT_21                  int64
AMT_REQ_CREDIT_BUREAU_HOUR      float64
AMT_REQ_CREDIT_BUREAU_DAY       float64
AMT_REQ_CREDIT_BUREAU_WEEK      float64
AMT_REQ_CREDIT_BUREAU_MON       float64
AMT_REQ_CREDIT_BUREAU_QRT       float64
AMT_REQ_CREDIT_BUREAU_YEAR      float64
dtype: object
In [14]:
df.head()
Out[14]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.00 406597.50 24700.50 351000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.02 -9461 -637 -3648.00 -2120 NaN 1 1 0 1 1 0 Laborers 1.00 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.08 0.26 0.14 0.02 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.63 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 reg oper account block of flats 0.01 Stone, brick No 2.00 2.00 2.00 2.00 -1134.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00
1 100003 0 Cash loans F N N 0 270000.00 1293502.50 35698.50 1129500.00 Family State servant Higher education Married House / apartment 0.00 -16765 -1188 -1186.00 -291 NaN 1 1 0 1 1 0 Core staff 2.00 1 1 MONDAY 11 0 0 0 0 0 0 School 0.31 0.62 NaN 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.05 0.00 0.01 0.09 0.05 0.99 0.80 0.05 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.00 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.01 reg oper account block of flats 0.07 Block No 1.00 0.00 1.00 0.00 -828.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
2 100004 0 Revolving loans M Y Y 0 67500.00 135000.00 6750.00 135000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.01 -19046 -225 -4260.00 -2531 26.00 1 1 1 1 1 0 Laborers 1.00 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.56 0.73 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -815.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
3 100006 0 Cash loans F N Y 0 135000.00 312682.50 29686.50 297000.00 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.01 -19005 -3039 -9833.00 -2437 NaN 1 1 0 1 0 0 Laborers 2.00 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 0.00 2.00 0.00 -617.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.00 513000.00 21865.50 513000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.03 -19932 -3038 -4311.00 -3458 NaN 1 1 0 1 0 0 Core staff 1.00 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1106.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00

Looking at the data and their corresponding data types, we can conclude that No Data Type changes are required

Dealing with missing values¶

Check if we have any null values in the dataset

In [15]:
df.isnull().values.any()
Out[15]:
True

Get Total number of null values in the dataset

In [16]:
df.isnull().values.sum()
Out[16]:
9152465

Getting the list of column(s) which have null values

In [17]:
df.columns[df.isnull().any()]
Out[17]:
Index(['AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'OWN_CAR_AGE',
       'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG',
       'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG',
       'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG',
       'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE',
       'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE',
       'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE',
       'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE',
       'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE',
       'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI',
       'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI',
       'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
       'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object')
In [18]:
len(df.columns[df.isnull().any()])
Out[18]:
67

There are totally 67 columns having one or more NULL values in the data

Computing count and percentage of missing values¶

In [19]:
null_count = df.isnull().sum()
null_percentage = round((df.isnull().sum()/df.shape[0])*100, 2)
In [20]:
null_df = pd.DataFrame({'column_name' : df.columns,'null_count' : null_count,'null_percentage': null_percentage})
null_df.reset_index(drop = True, inplace = True)
In [21]:
null_df.sort_values(by = 'null_percentage', ascending = False)
Out[21]:
column_name null_count null_percentage
76 COMMONAREA_MEDI 214865 69.87
48 COMMONAREA_AVG 214865 69.87
62 COMMONAREA_MODE 214865 69.87
70 NONLIVINGAPARTMENTS_MODE 213514 69.43
56 NONLIVINGAPARTMENTS_AVG 213514 69.43
84 NONLIVINGAPARTMENTS_MEDI 213514 69.43
86 FONDKAPREMONT_MODE 210295 68.39
68 LIVINGAPARTMENTS_MODE 210199 68.35
54 LIVINGAPARTMENTS_AVG 210199 68.35
82 LIVINGAPARTMENTS_MEDI 210199 68.35
52 FLOORSMIN_AVG 208642 67.85
66 FLOORSMIN_MODE 208642 67.85
80 FLOORSMIN_MEDI 208642 67.85
75 YEARS_BUILD_MEDI 204488 66.50
61 YEARS_BUILD_MODE 204488 66.50
47 YEARS_BUILD_AVG 204488 66.50
21 OWN_CAR_AGE 202929 65.99
81 LANDAREA_MEDI 182590 59.38
67 LANDAREA_MODE 182590 59.38
53 LANDAREA_AVG 182590 59.38
73 BASEMENTAREA_MEDI 179943 58.52
45 BASEMENTAREA_AVG 179943 58.52
59 BASEMENTAREA_MODE 179943 58.52
41 EXT_SOURCE_1 173378 56.38
71 NONLIVINGAREA_MODE 169682 55.18
57 NONLIVINGAREA_AVG 169682 55.18
85 NONLIVINGAREA_MEDI 169682 55.18
77 ELEVATORS_MEDI 163891 53.30
49 ELEVATORS_AVG 163891 53.30
63 ELEVATORS_MODE 163891 53.30
89 WALLSMATERIAL_MODE 156341 50.84
72 APARTMENTS_MEDI 156061 50.75
44 APARTMENTS_AVG 156061 50.75
58 APARTMENTS_MODE 156061 50.75
78 ENTRANCES_MEDI 154828 50.35
50 ENTRANCES_AVG 154828 50.35
64 ENTRANCES_MODE 154828 50.35
55 LIVINGAREA_AVG 154350 50.19
69 LIVINGAREA_MODE 154350 50.19
83 LIVINGAREA_MEDI 154350 50.19
87 HOUSETYPE_MODE 154297 50.18
65 FLOORSMAX_MODE 153020 49.76
79 FLOORSMAX_MEDI 153020 49.76
51 FLOORSMAX_AVG 153020 49.76
60 YEARS_BEGINEXPLUATATION_MODE 150007 48.78
74 YEARS_BEGINEXPLUATATION_MEDI 150007 48.78
46 YEARS_BEGINEXPLUATATION_AVG 150007 48.78
88 TOTALAREA_MODE 148431 48.27
90 EMERGENCYSTATE_MODE 145755 47.40
28 OCCUPATION_TYPE 96391 31.35
43 EXT_SOURCE_3 60965 19.83
116 AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.50
117 AMT_REQ_CREDIT_BUREAU_DAY 41519 13.50
118 AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.50
119 AMT_REQ_CREDIT_BUREAU_MON 41519 13.50
120 AMT_REQ_CREDIT_BUREAU_QRT 41519 13.50
121 AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.50
11 NAME_TYPE_SUITE 1292 0.42
92 DEF_30_CNT_SOCIAL_CIRCLE 1021 0.33
91 OBS_30_CNT_SOCIAL_CIRCLE 1021 0.33
93 OBS_60_CNT_SOCIAL_CIRCLE 1021 0.33
94 DEF_60_CNT_SOCIAL_CIRCLE 1021 0.33
42 EXT_SOURCE_2 660 0.21
10 AMT_GOODS_PRICE 278 0.09
6 CNT_CHILDREN 0 0.00
102 FLAG_DOCUMENT_8 0 0.00
2 NAME_CONTRACT_TYPE 0 0.00
3 CODE_GENDER 0 0.00
4 FLAG_OWN_CAR 0 0.00
95 DAYS_LAST_PHONE_CHANGE 1 0.00
96 FLAG_DOCUMENT_2 0 0.00
97 FLAG_DOCUMENT_3 0 0.00
98 FLAG_DOCUMENT_4 0 0.00
99 FLAG_DOCUMENT_5 0 0.00
100 FLAG_DOCUMENT_6 0 0.00
101 FLAG_DOCUMENT_7 0 0.00
103 FLAG_DOCUMENT_9 0 0.00
115 FLAG_DOCUMENT_21 0 0.00
104 FLAG_DOCUMENT_10 0 0.00
105 FLAG_DOCUMENT_11 0 0.00
5 FLAG_OWN_REALTY 0 0.00
107 FLAG_DOCUMENT_13 0 0.00
108 FLAG_DOCUMENT_14 0 0.00
109 FLAG_DOCUMENT_15 0 0.00
110 FLAG_DOCUMENT_16 0 0.00
111 FLAG_DOCUMENT_17 0 0.00
112 FLAG_DOCUMENT_18 0 0.00
113 FLAG_DOCUMENT_19 0 0.00
114 FLAG_DOCUMENT_20 0 0.00
106 FLAG_DOCUMENT_12 0 0.00
8 AMT_CREDIT 0 0.00
7 AMT_INCOME_TOTAL 0 0.00
26 FLAG_PHONE 0 0.00
39 LIVE_CITY_NOT_WORK_CITY 0 0.00
38 REG_CITY_NOT_WORK_CITY 0 0.00
1 TARGET 0 0.00
37 REG_CITY_NOT_LIVE_CITY 0 0.00
36 LIVE_REGION_NOT_WORK_REGION 0 0.00
35 REG_REGION_NOT_WORK_REGION 0 0.00
34 REG_REGION_NOT_LIVE_REGION 0 0.00
33 HOUR_APPR_PROCESS_START 0 0.00
32 WEEKDAY_APPR_PROCESS_START 0 0.00
31 REGION_RATING_CLIENT_W_CITY 0 0.00
30 REGION_RATING_CLIENT 0 0.00
29 CNT_FAM_MEMBERS 2 0.00
27 FLAG_EMAIL 0 0.00
25 FLAG_CONT_MOBILE 0 0.00
40 ORGANIZATION_TYPE 0 0.00
24 FLAG_WORK_PHONE 0 0.00
23 FLAG_EMP_PHONE 0 0.00
22 FLAG_MOBIL 0 0.00
20 DAYS_ID_PUBLISH 0 0.00
19 DAYS_REGISTRATION 0 0.00
18 DAYS_EMPLOYED 0 0.00
17 DAYS_BIRTH 0 0.00
16 REGION_POPULATION_RELATIVE 0 0.00
15 NAME_HOUSING_TYPE 0 0.00
14 NAME_FAMILY_STATUS 0 0.00
13 NAME_EDUCATION_TYPE 0 0.00
12 NAME_INCOME_TYPE 0 0.00
9 AMT_ANNUITY 12 0.00
0 SK_ID_CURR 0 0.00

Removing columns with NULL values > 40%¶

Getting list of columns with NULL values > 40% into a list. We will be removing these columns from the dataframe as there are too many missing values.

In [22]:
columns_to_be_deleted = null_df[null_df['null_percentage'] > 40].column_name.to_list()
In [23]:
columns_to_be_deleted
Out[23]:
['OWN_CAR_AGE',
 'EXT_SOURCE_1',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'TOTALAREA_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE']
In [24]:
len(columns_to_be_deleted)
Out[24]:
49

There are totally 49 columns to be removed. Deleting them from main dataframe df

In [25]:
df.drop(columns = columns_to_be_deleted, inplace = True)

Checking column count post removal. Only 73 columns should be left

In [26]:
df.shape
Out[26]:
(307511, 73)

Checking columns with NULL values < 40%¶

Creating dataframe null_df_under40 with missing column percentages under 40%

In [27]:
null_df_under40 = null_df[null_df['null_percentage'] < 40]
In [28]:
null_df_under40.sort_values(by = 'null_percentage', ascending = False)
Out[28]:
column_name null_count null_percentage
28 OCCUPATION_TYPE 96391 31.35
43 EXT_SOURCE_3 60965 19.83
121 AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.50
120 AMT_REQ_CREDIT_BUREAU_QRT 41519 13.50
119 AMT_REQ_CREDIT_BUREAU_MON 41519 13.50
118 AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.50
117 AMT_REQ_CREDIT_BUREAU_DAY 41519 13.50
116 AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.50
11 NAME_TYPE_SUITE 1292 0.42
91 OBS_30_CNT_SOCIAL_CIRCLE 1021 0.33
92 DEF_30_CNT_SOCIAL_CIRCLE 1021 0.33
93 OBS_60_CNT_SOCIAL_CIRCLE 1021 0.33
94 DEF_60_CNT_SOCIAL_CIRCLE 1021 0.33
42 EXT_SOURCE_2 660 0.21
10 AMT_GOODS_PRICE 278 0.09
101 FLAG_DOCUMENT_7 0 0.00
95 DAYS_LAST_PHONE_CHANGE 1 0.00
96 FLAG_DOCUMENT_2 0 0.00
97 FLAG_DOCUMENT_3 0 0.00
98 FLAG_DOCUMENT_4 0 0.00
99 FLAG_DOCUMENT_5 0 0.00
100 FLAG_DOCUMENT_6 0 0.00
105 FLAG_DOCUMENT_11 0 0.00
102 FLAG_DOCUMENT_8 0 0.00
103 FLAG_DOCUMENT_9 0 0.00
104 FLAG_DOCUMENT_10 0 0.00
39 LIVE_CITY_NOT_WORK_CITY 0 0.00
106 FLAG_DOCUMENT_12 0 0.00
107 FLAG_DOCUMENT_13 0 0.00
108 FLAG_DOCUMENT_14 0 0.00
109 FLAG_DOCUMENT_15 0 0.00
110 FLAG_DOCUMENT_16 0 0.00
111 FLAG_DOCUMENT_17 0 0.00
112 FLAG_DOCUMENT_18 0 0.00
113 FLAG_DOCUMENT_19 0 0.00
114 FLAG_DOCUMENT_20 0 0.00
115 FLAG_DOCUMENT_21 0 0.00
40 ORGANIZATION_TYPE 0 0.00
0 SK_ID_CURR 0 0.00
38 REG_CITY_NOT_WORK_CITY 0 0.00
9 AMT_ANNUITY 12 0.00
17 DAYS_BIRTH 0 0.00
16 REGION_POPULATION_RELATIVE 0 0.00
15 NAME_HOUSING_TYPE 0 0.00
14 NAME_FAMILY_STATUS 0 0.00
13 NAME_EDUCATION_TYPE 0 0.00
12 NAME_INCOME_TYPE 0 0.00
8 AMT_CREDIT 0 0.00
1 TARGET 0 0.00
7 AMT_INCOME_TOTAL 0 0.00
6 CNT_CHILDREN 0 0.00
5 FLAG_OWN_REALTY 0 0.00
4 FLAG_OWN_CAR 0 0.00
3 CODE_GENDER 0 0.00
2 NAME_CONTRACT_TYPE 0 0.00
18 DAYS_EMPLOYED 0 0.00
19 DAYS_REGISTRATION 0 0.00
20 DAYS_ID_PUBLISH 0 0.00
22 FLAG_MOBIL 0 0.00
23 FLAG_EMP_PHONE 0 0.00
24 FLAG_WORK_PHONE 0 0.00
25 FLAG_CONT_MOBILE 0 0.00
26 FLAG_PHONE 0 0.00
27 FLAG_EMAIL 0 0.00
29 CNT_FAM_MEMBERS 2 0.00
30 REGION_RATING_CLIENT 0 0.00
31 REGION_RATING_CLIENT_W_CITY 0 0.00
32 WEEKDAY_APPR_PROCESS_START 0 0.00
33 HOUR_APPR_PROCESS_START 0 0.00
34 REG_REGION_NOT_LIVE_REGION 0 0.00
35 REG_REGION_NOT_WORK_REGION 0 0.00
36 LIVE_REGION_NOT_WORK_REGION 0 0.00
37 REG_CITY_NOT_LIVE_CITY 0 0.00

Analysis of OCCUPATION_TYPE column¶

  • nullable values = 31.35%
In [29]:
df['OCCUPATION_TYPE'].value_counts()
Out[29]:
Laborers                 55186
Sales staff              32102
Core staff               27570
Managers                 21371
Drivers                  18603
High skill tech staff    11380
Accountants               9813
Medicine staff            8537
Security staff            6721
Cooking staff             5946
Cleaning staff            4653
Private service staff     2652
Low-skill Laborers        2093
Waiters/barmen staff      1348
Secretaries               1305
Realty agents              751
HR staff                   563
IT staff                   526
Name: OCCUPATION_TYPE, dtype: int64

Impute NULL values with Unknown category

In [30]:
df['OCCUPATION_TYPE'].fillna(value = 'Unknown', inplace = True)
In [31]:
plt.figure(figsize = (10,5))
sns.countplot(data = df, x = "OCCUPATION_TYPE")
plt.xticks(rotation = 90)
plt.show()
No description has been provided for this image

Observations

  • Looking at the plot, Laborers has the highest number of loan applicants
  • For imputation, it would be better to leave the data as is (missing values being 31.35%) and not impute to min/min/mode/median as it may bias the data in later computations

Analysis of EXT_SOURCE_3 column¶

  • nullable values = 19.83%
In [32]:
df.EXT_SOURCE_3.value_counts().head()
Out[32]:
0.75    1460
0.71    1315
0.69    1276
0.67    1191
0.65    1154
Name: EXT_SOURCE_3, dtype: int64
In [33]:
sns.boxplot(df.EXT_SOURCE_3)
plt.show()
No description has been provided for this image

Getting percentile values for EXT_SOURCE_3

In [34]:
df.EXT_SOURCE_3.quantile(q = [0.25,0.5,0.75,1])
Out[34]:
0.25   0.37
0.50   0.54
0.75   0.67
1.00   0.90
Name: EXT_SOURCE_3, dtype: float64

Most recurring value in EXT_SOURCE_3

In [35]:
df.EXT_SOURCE_3.mode()[0]
Out[35]:
0.746300213050371

Checking the average value of EXT_SOURCE_3

In [36]:
df.EXT_SOURCE_3.mean()
Out[36]:
0.5108529061800121

Observations

  • Looking at the boxplot, median is 0.535276
  • Most recurring value is 0.746300213050371
  • Mean value is 0.5108529061800121
  • Though mean and median are closer and could be used for imputation, since missing percentage value is higher (19.83%), it would be better to leave the data as it is and not perform imputations

Analysis of six columns with 13.5% missing values¶

  • AMT_REQ_CREDIT_BUREAU_YEAR
  • AMT_REQ_CREDIT_BUREAU_QRT
  • AMT_REQ_CREDIT_BUREAU_MON
  • AMT_REQ_CREDIT_BUREAU_WEEK
  • AMT_REQ_CREDIT_BUREAU_DAY
  • AMT_REQ_CREDIT_BUREAU_HOUR

nullable values = 13.50%

Looking at summary statistics for the columns

In [37]:
df[['AMT_REQ_CREDIT_BUREAU_YEAR',
    'AMT_REQ_CREDIT_BUREAU_QRT',
    'AMT_REQ_CREDIT_BUREAU_MON',
    'AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_HOUR']].describe()
Out[37]:
AMT_REQ_CREDIT_BUREAU_YEAR AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_HOUR
count 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 1.90 0.27 0.27 0.03 0.01 0.01
std 1.87 0.79 0.92 0.20 0.11 0.08
min 0.00 0.00 0.00 0.00 0.00 0.00
25% 0.00 0.00 0.00 0.00 0.00 0.00
50% 1.00 0.00 0.00 0.00 0.00 0.00
75% 3.00 0.00 0.00 0.00 0.00 0.00
max 25.00 261.00 27.00 8.00 9.00 4.00

Most recurring value for the columns

In [38]:
df[['AMT_REQ_CREDIT_BUREAU_YEAR',
    'AMT_REQ_CREDIT_BUREAU_QRT',
    'AMT_REQ_CREDIT_BUREAU_MON',
    'AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_HOUR']].mode()
Out[38]:
AMT_REQ_CREDIT_BUREAU_YEAR AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_HOUR
0 0.00 0.00 0.00 0.00 0.00 0.00

Observations for the below columns

  • AMT_REQ_CREDIT_BUREAU_YEAR
  • Mean is 1.899974 and Median is 1.000000
  • Mode is 0
  • Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
  • AMT_REQ_CREDIT_BUREAU_QRT
  • Mean is 0.265474 and Median is 0
  • Mode is 0
  • Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
  • AMT_REQ_CREDIT_BUREAU_MON
  • Mean is 0.267395 and Median is 0
  • Mode is 0
  • Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
  • AMT_REQ_CREDIT_BUREAU_WEEK
  • Mean is 0.034362 and Median is 0
  • Mode is 0
  • Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
  • AMT_REQ_CREDIT_BUREAU_DAY
  • Mean is 0.007000 and Median is 0
  • Mode is 0
  • Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
  • AMT_REQ_CREDIT_BUREAU_HOUR
  • Mean is 0.006402 and Median is 0
  • Mode is 0
  • Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data

Checking columns with NULL values > 0% and < 1%¶

Creating dataframe null_df_under1 with missing column percentages values > 0% and < 1%

In [39]:
null_df_under1 = null_df[(null_df['null_percentage'] > 0) & (null_df['null_percentage'] < 1)]
In [40]:
null_df_under1.sort_values(by = 'null_percentage', ascending = False)
Out[40]:
column_name null_count null_percentage
11 NAME_TYPE_SUITE 1292 0.42
91 OBS_30_CNT_SOCIAL_CIRCLE 1021 0.33
92 DEF_30_CNT_SOCIAL_CIRCLE 1021 0.33
93 OBS_60_CNT_SOCIAL_CIRCLE 1021 0.33
94 DEF_60_CNT_SOCIAL_CIRCLE 1021 0.33
42 EXT_SOURCE_2 660 0.21
10 AMT_GOODS_PRICE 278 0.09

Analysis of NAME_TYPE_SUITE column¶

  • nullable values = 0.42%
In [41]:
df['NAME_TYPE_SUITE'].value_counts()
Out[41]:
Unaccompanied      248526
Family              40149
Spouse, partner     11370
Children             3267
Other_B              1770
Other_A               866
Group of people       271
Name: NAME_TYPE_SUITE, dtype: int64
In [42]:
plt.figure(figsize = (10,5))
sns.countplot(data = df, x = "NAME_TYPE_SUITE")
plt.xticks(rotation = 90)
plt.show()
No description has been provided for this image

Observations

  • Looking at the plot, Unaccompanied category has the highest number of loan applicants. So, most of the loan applicants venture out alone for applying loan
  • We can go ahead and impute Unaccompanied in the dataframe

Analysis of OBS_30_CNT_SOCIAL_CIRCLE column¶

  • nullable values = 0.33%
In [43]:
df.OBS_30_CNT_SOCIAL_CIRCLE.value_counts().head()
Out[43]:
0.00    163910
1.00     48783
2.00     29808
3.00     20322
4.00     14143
Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: int64
In [44]:
sns.boxplot(df.OBS_30_CNT_SOCIAL_CIRCLE)
plt.show()
No description has been provided for this image

Getting percentile values for OBS_30_CNT_SOCIAL_CIRCLE

In [45]:
df.OBS_30_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,1])
Out[45]:
0.25     0.00
0.50     0.00
0.75     2.00
1.00   348.00
Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64

Most recurring value in OBS_30_CNT_SOCIAL_CIRCLE

In [46]:
df.OBS_30_CNT_SOCIAL_CIRCLE.mode()[0]
Out[46]:
0.0

Checking the average value of OBS_30_CNT_SOCIAL_CIRCLE

In [47]:
df.OBS_30_CNT_SOCIAL_CIRCLE.mean()
Out[47]:
1.4222454239942575

Observations

  • Looking at the boxplot, median is 0.0
  • Most recurring value is 0.0
  • Mean value is 1.4222454239942575
  • There are two outlier values at 50 and 350.
  • Mean and mode are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

Analysis of DEF_30_CNT_SOCIAL_CIRCLE column¶

  • nullable values = 0.33%
In [48]:
df.DEF_30_CNT_SOCIAL_CIRCLE.value_counts().head()
Out[48]:
0.00    271324
1.00     28328
2.00      5323
3.00      1192
4.00       253
Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: int64
In [49]:
sns.boxplot(df.DEF_30_CNT_SOCIAL_CIRCLE)
plt.show()
No description has been provided for this image

Getting percentile values for DEF_30_CNT_SOCIAL_CIRCLE

In [50]:
df.DEF_30_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[50]:
0.25    0.00
0.50    0.00
0.75    0.00
0.99    2.00
1.00   34.00
Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64

Most recurring value in DEF_30_CNT_SOCIAL_CIRCLE

In [51]:
df.DEF_30_CNT_SOCIAL_CIRCLE.mode()[0]
Out[51]:
0.0

Checking the average value of DEF_30_CNT_SOCIAL_CIRCLE

In [52]:
df.DEF_30_CNT_SOCIAL_CIRCLE.mean()
Out[52]:
0.1434206662533851

Observations

  • Looking at the boxplot, median is 0.0
  • Most recurring value is 0.0
  • Mean value is 0.1434206662533851
  • Even 99th percentile value is 2. There are ~7 outliers the largest of which is ~33.
  • Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

Analysis of OBS_60_CNT_SOCIAL_CIRCLE column¶

  • nullable values = 0.33%
In [53]:
df.OBS_60_CNT_SOCIAL_CIRCLE.value_counts().head()
Out[53]:
0.00    164666
1.00     48870
2.00     29766
3.00     20215
4.00     13946
Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: int64
In [54]:
sns.boxplot(df.OBS_60_CNT_SOCIAL_CIRCLE)
plt.show()
No description has been provided for this image

Getting percentile values for OBS_60_CNT_SOCIAL_CIRCLE

In [55]:
df.OBS_60_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[55]:
0.25     0.00
0.50     0.00
0.75     2.00
0.99    10.00
1.00   344.00
Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64

Most recurring value in OBS_60_CNT_SOCIAL_CIRCLE

In [56]:
df.OBS_60_CNT_SOCIAL_CIRCLE.mode()[0]
Out[56]:
0.0

Checking the average value of OBS_60_CNT_SOCIAL_CIRCLE

In [57]:
df.OBS_60_CNT_SOCIAL_CIRCLE.mean()
Out[57]:
1.4052921791901856

Observations

  • Looking at the boxplot, median is 0.0
  • Most recurring value is 0.0
  • Mean value is 1.4052921791901856
  • Even 99th percentile value is 10. There is a prominent outlier at 50 and 350 approximately.
  • Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

Analysis of DEF_60_CNT_SOCIAL_CIRCLE column¶

  • nullable values = 0.33%
In [58]:
df.DEF_60_CNT_SOCIAL_CIRCLE.value_counts().head()
Out[58]:
0.00    280721
1.00     21841
2.00      3170
3.00       598
4.00       135
Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: int64
In [59]:
sns.boxplot(df.DEF_60_CNT_SOCIAL_CIRCLE)
plt.show()
No description has been provided for this image

Getting percentile values for DEF_60_CNT_SOCIAL_CIRCLE

In [60]:
df.DEF_60_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[60]:
0.25    0.00
0.50    0.00
0.75    0.00
0.99    2.00
1.00   24.00
Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64

Most recurring value in DEF_60_CNT_SOCIAL_CIRCLE

In [61]:
df.DEF_60_CNT_SOCIAL_CIRCLE.mode()[0]
Out[61]:
0.0

Checking the average value of DEF_60_CNT_SOCIAL_CIRCLE

In [62]:
df.DEF_60_CNT_SOCIAL_CIRCLE.mean()
Out[62]:
0.10004894123788705

Observations

  • Looking at the boxplot, median is 0.0
  • Most recurring value is 0.0
  • Mean value is 0.10004894123788705
  • Even 99th percentile value is 2. There are ~7 outliers the largest of which is ~24.
  • Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)

Analysis of EXT_SOURCE_2 column¶

  • nullable values = 0.21%
In [63]:
df.EXT_SOURCE_2.value_counts().head()
Out[63]:
0.29    721
0.26    417
0.27    343
0.16    322
0.27    306
Name: EXT_SOURCE_2, dtype: int64
In [64]:
sns.boxplot(df.EXT_SOURCE_2)
plt.show()
No description has been provided for this image

Getting percentile values for EXT_SOURCE_2

In [65]:
df.EXT_SOURCE_2.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[65]:
0.25   0.39
0.50   0.57
0.75   0.66
0.99   0.78
1.00   0.85
Name: EXT_SOURCE_2, dtype: float64

Most recurring value in EXT_SOURCE_2

In [66]:
df.EXT_SOURCE_2.mode()[0]
Out[66]:
0.2858978721410488

Checking the average value of EXT_SOURCE_2

In [67]:
df.EXT_SOURCE_2.mean()
Out[67]:
0.5143926741308463

Observations

  • Looking at the boxplot, median is 0.565961
  • Most recurring value is 0.2858978721410488
  • Mean value is 0.5143926741308463
  • There is no outlier in the dataset
  • Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.21%)

Analysis of AMT_GOODS_PRICE column¶

  • nullable values = 0.09%
In [68]:
df.AMT_GOODS_PRICE.value_counts().head()
Out[68]:
450000.00    26022
225000.00    25282
675000.00    24962
900000.00    15416
270000.00    11428
Name: AMT_GOODS_PRICE, dtype: int64
In [69]:
sns.boxplot(df.AMT_GOODS_PRICE)
plt.show()
No description has been provided for this image

Getting percentile values for AMT_GOODS_PRICE

In [70]:
df.AMT_GOODS_PRICE.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[70]:
0.25    238500.00
0.50    450000.00
0.75    679500.00
0.99   1800000.00
1.00   4050000.00
Name: AMT_GOODS_PRICE, dtype: float64

Most recurring value in AMT_GOODS_PRICE

In [71]:
df.AMT_GOODS_PRICE.mode()[0]
Out[71]:
450000.0

Checking the average value of AMT_GOODS_PRICE

In [72]:
df.AMT_GOODS_PRICE.mean()
Out[72]:
538396.2074288895

Observations

  • Looking at the boxplot, median is 450000.0
  • Most recurring value is 450000.0. So, median and mode are the same
  • Mean value is 538396.2074288895
  • Though there are values above 2500000 they cannot be treated as outliers as it could be a valid goods price
  • Mean and median are exactly same and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.09%)

Dealing with incorrect/unknown data values¶

Analysis of CODE_GENDER column¶

Checking range of values

In [73]:
df['CODE_GENDER'].value_counts()
Out[73]:
F      202448
M      105059
XNA         4
Name: CODE_GENDER, dtype: int64

Gender should only be Male or Female. XNA value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application

In [74]:
df[df['CODE_GENDER'] == 'XNA']
Out[74]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
35657 141289 0 Revolving loans XNA Y Y 0 207000.00 382500.00 19125.00 337500.00 Unaccompanied Working Secondary / secondary special Married Municipal apartment 0.02 -20232 -10044 -10024.00 -3537 1 1 1 1 1 0 Unknown 2.00 3 3 TUESDAY 10 0 0 0 0 0 0 Business Entity Type 1 0.30 0.46 0.00 0.00 0.00 0.00 -286.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00
38566 144669 0 Revolving loans XNA N Y 2 157500.00 270000.00 13500.00 225000.00 Family Working Secondary / secondary special Married House / apartment 0.03 -13717 -2797 -2241.00 -4659 1 1 1 1 1 0 Low-skill Laborers 4.00 2 2 FRIDAY 16 0 0 0 0 0 0 Industry: type 3 0.71 0.31 0.00 0.00 0.00 0.00 -493.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 3.00 0.00 4.00
83382 196708 0 Revolving loans XNA N Y 1 135000.00 405000.00 20250.00 225000.00 Unaccompanied Working Higher education Married House / apartment 0.04 -10647 -1228 -183.00 -1671 1 1 1 1 1 0 Core staff 3.00 2 2 THURSDAY 15 0 0 0 0 0 0 Kindergarten 0.66 0.08 7.00 1.00 7.00 1.00 -851.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 3.00
189640 319880 0 Revolving loans XNA Y Y 0 247500.00 540000.00 27000.00 900000.00 Unaccompanied Commercial associate Incomplete higher Civil marriage House / apartment 0.04 -9649 -2293 -4099.00 -2326 1 1 1 1 1 0 Unknown 2.00 2 2 FRIDAY 15 0 0 0 0 0 0 Medicine 0.66 0.36 10.00 4.00 10.00 4.00 -1681.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 1.00 6.00

As data looks valid, we will go ahead and check for an imputation method.

  • Female applicants are twice the number of Male applicants
  • And so, we will go ahead and impute CODE_GENDER with 'F'
In [75]:
df['CODE_GENDER'] = df['CODE_GENDER'].apply(lambda x: 'F' if x == 'XNA' else x)

Checking if XNA is removed

In [76]:
df['CODE_GENDER'].value_counts()
Out[76]:
F    202452
M    105059
Name: CODE_GENDER, dtype: int64

Analysis of DAYS_BIRTH column¶

In [77]:
df['DAYS_BIRTH'].value_counts().head()
Out[77]:
-13749    43
-13481    42
-10020    41
-18248    41
-15771    40
Name: DAYS_BIRTH, dtype: int64

There are ~17K+ unique records all of which seem to be having negative values

In [78]:
df['DAYS_BIRTH'].unique()
Out[78]:
array([ -9461, -16765, -19046, ...,  -7951,  -7857, -25061])
In [79]:
df['DAYS_BIRTH'].nunique()
Out[79]:
17460

Converting Days Birth to positive days

In [80]:
df['DAYS_BIRTH'] = df['DAYS_BIRTH'].apply(lambda x: -x if x < 0 else x)
In [81]:
df['DAYS_BIRTH'].value_counts()
Out[81]:
13749    43
13481    42
18248    41
10020    41
15771    40
         ..
24880     1
7934      1
16436     1
25177     1
24517     1
Name: DAYS_BIRTH, Length: 17460, dtype: int64

All Days in DAYS_BIRTH have positive values

Creating a new column YEARS_BIRTH for ease of analysis¶

In [82]:
df['YEARS_BIRTH'] = df['DAYS_BIRTH'].apply(lambda x: round(x/365))

Analysis of NAME_FAMILY_STATUS column¶

Checking range of values

In [83]:
df['NAME_FAMILY_STATUS'].value_counts()
Out[83]:
Married                 196432
Single / not married     45444
Civil marriage           29775
Separated                19770
Widow                    16088
Unknown                      2
Name: NAME_FAMILY_STATUS, dtype: int64

Gender should only be Male or Female. Unknown value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application

In [84]:
df[df['NAME_FAMILY_STATUS'] == 'Unknown']
Out[84]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEARS_BIRTH
41982 148605 0 Revolving loans M N Y 0 450000.00 675000.00 33750.00 NaN NaN Commercial associate Lower secondary Unknown Municipal apartment 0.02 12396 -1161 -3265.00 -4489 1 1 1 1 1 0 Managers NaN 2 2 THURSDAY 15 0 1 1 0 1 1 Insurance 0.70 NaN 3.00 0.00 3.00 0.00 -876.00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 34
187348 317181 0 Revolving loans F N Y 0 202500.00 585000.00 29250.00 NaN NaN Commercial associate Higher education Unknown House / apartment 0.03 12844 -232 -1597.00 -1571 1 1 0 1 0 0 Accountants NaN 2 2 FRIDAY 14 0 0 0 0 0 0 Business Entity Type 2 0.65 0.67 1.00 0.00 1.00 0.00 -654.00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00 35
In [85]:
df['NAME_FAMILY_STATUS'].value_counts(normalize = True) * 100
Out[85]:
Married                63.88
Single / not married   14.78
Civil marriage          9.68
Separated               6.43
Widow                   5.23
Unknown                 0.00
Name: NAME_FAMILY_STATUS, dtype: float64

As data looks valid, we will go ahead and check for an imputation method.

  • Married applicants make up more than 63% of applicants.
  • Hence, we will go ahead and impute NAME_FAMILY_STATUS with 'Married'
In [86]:
df['NAME_FAMILY_STATUS'] = df['NAME_FAMILY_STATUS'].apply(lambda x: 'Married' if x == 'Unknown' else x)

Checking if Unknown is removed

In [87]:
df['NAME_FAMILY_STATUS'].value_counts()
Out[87]:
Married                 196434
Single / not married     45444
Civil marriage           29775
Separated                19770
Widow                    16088
Name: NAME_FAMILY_STATUS, dtype: int64

Analysis of DAYS_EMPLOYED column¶

In [88]:
df['DAYS_EMPLOYED'].value_counts().head()
Out[88]:
 365243    55374
-200         156
-224         152
-230         151
-199         151
Name: DAYS_EMPLOYED, dtype: int64
In [89]:
df['DAYS_EMPLOYED'].value_counts(normalize = True) * 100
Out[89]:
 365243   18.01
-200       0.05
-224       0.05
-230       0.05
-199       0.05
           ... 
-11060     0.00
-10409     0.00
-10155     0.00
-11948     0.00
-12341     0.00
Name: DAYS_EMPLOYED, Length: 12574, dtype: float64
In [90]:
len(df[df['DAYS_EMPLOYED'] < 365243])
Out[90]:
252137
In [91]:
df[df['DAYS_EMPLOYED'] < 365243].DAYS_EMPLOYED.value_counts()
Out[91]:
-200      156
-224      152
-199      151
-230      151
-212      150
         ... 
-12824      1
-8609       1
-8673       1
-6803       1
-9374       1
Name: DAYS_EMPLOYED, Length: 12573, dtype: int64
In [92]:
df['DAYS_EMPLOYED'].unique()
Out[92]:
array([  -637,  -1188,   -225, ..., -12971, -11084,  -8694])
In [93]:
df['DAYS_EMPLOYED'].nunique()
Out[93]:
12574

Observations

  • There are ~55K+ records for which DAYS_EMPLOYED is 365243 days
  • Remaining 252K+ records have negative value for days
  • There are 12,574 unique values for DAYS_EMPLOYED
  • DAYS_EMPLOYED column indicates how many days before the application the person started current employment, the applicant/loan officer must have entered negative values to indicate the days before.
  • We will convert negative values in DAYS_EMPLOYED to positive days to standardize the days during use in calculations
In [94]:
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: -x if x < 0 else x)
In [95]:
df['DAYS_EMPLOYED'].value_counts().head()
Out[95]:
365243    55374
200         156
224         152
199         151
230         151
Name: DAYS_EMPLOYED, dtype: int64

We can see that all days in DAYS_EMPLOYED have positive values

For ~55K+ records for which DAYS_EMPLOYED is 365243 days

  • Converting this to year gives us 1000 years which is physically impossible to be employed for an applicant
  • This is present for 18% of the data and cannot be an anamoly
  • They could either be Pensioners or Unemployed and looking at the data our conclusion is correct

There are two ways to handle this

  1. We let the data remain as it is and account for this during analysis OR
  2. We calculate Average Days Employed excluding this category and impute that instead of 365243 days for Pensioners.
    For Unemployed, Days Employed can be 0

Note

  • During calculations with this column, we need to account for this scenario as otherwise it will skew our results
In [96]:
df[df['DAYS_EMPLOYED'] == 365243].NAME_INCOME_TYPE.value_counts()
Out[96]:
Pensioner     55352
Unemployed       22
Name: NAME_INCOME_TYPE, dtype: int64

Creating a new column YEARS_EMPLOYED for ease of analysis¶

In [97]:
df['YEARS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: round(x/365))

Analysis of DAYS_REGISTRATION column¶

In [98]:
df['DAYS_REGISTRATION'].value_counts().head()
Out[98]:
-1.00    113
-7.00     98
-6.00     96
-4.00     92
-2.00     92
Name: DAYS_REGISTRATION, dtype: int64
In [99]:
df['DAYS_REGISTRATION'].value_counts(normalize = True).head()
Out[99]:
-1.00   0.00
-7.00   0.00
-6.00   0.00
-4.00   0.00
-2.00   0.00
Name: DAYS_REGISTRATION, dtype: float64
In [100]:
df['DAYS_REGISTRATION'].unique()
Out[100]:
array([ -3648.,  -1186.,  -4260., ..., -16396., -14558., -14798.])
In [101]:
df['DAYS_REGISTRATION'].nunique()
Out[101]:
15688

Converting DAYS_REGISTRATION to positive days

In [102]:
df['DAYS_REGISTRATION'] = df['DAYS_REGISTRATION'].apply(lambda x: -x if x < 0 else x)
In [103]:
df['DAYS_REGISTRATION'].value_counts().head()
Out[103]:
1.00    113
7.00     98
6.00     96
2.00     92
4.00     92
Name: DAYS_REGISTRATION, dtype: int64

All Days in DAYS_REGISTRATION have positive values

Creating a new column YEARS_REGISTRATION for ease of analysis¶

In [104]:
df['YEARS_REGISTRATION'] = df['DAYS_REGISTRATION'].apply(lambda x: round(x/365))

Analysis of DAYS_ID_PUBLISH column¶

In [105]:
df['DAYS_ID_PUBLISH'].value_counts().head()
Out[105]:
-4053    169
-4095    162
-4046    161
-4417    159
-4256    158
Name: DAYS_ID_PUBLISH, dtype: int64
In [106]:
df['DAYS_ID_PUBLISH'].value_counts(normalize = True).head()
Out[106]:
-4053   0.00
-4095   0.00
-4046   0.00
-4417   0.00
-4256   0.00
Name: DAYS_ID_PUBLISH, dtype: float64
In [107]:
df['DAYS_ID_PUBLISH'].unique()
Out[107]:
array([-2120,  -291, -2531, ..., -6194, -5854, -6211])
In [108]:
df['DAYS_ID_PUBLISH'].nunique()
Out[108]:
6168

Converting DAYS_ID_PUBLISH to positive days

In [109]:
df['DAYS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].apply(lambda x: -x if x < 0 else x)
In [110]:
df['DAYS_ID_PUBLISH'].value_counts().head()
Out[110]:
4053    169
4095    162
4046    161
4417    159
4256    158
Name: DAYS_ID_PUBLISH, dtype: int64

All Days in DAYS_ID_PUBLISH have positive values

Creating a new column YEARS_ID_PUBLISH for ease of analysis¶

In [111]:
df['YEARS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].apply(lambda x: round(x/365))

Analysis of DAYS_LAST_PHONE_CHANGE column¶

In [112]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts().head()
Out[112]:
0.00     37672
-1.00     2812
-2.00     2318
-3.00     1763
-4.00     1285
Name: DAYS_LAST_PHONE_CHANGE, dtype: int64
In [113]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts(normalize = True).head()
Out[113]:
0.00    0.12
-1.00   0.01
-2.00   0.01
-3.00   0.01
-4.00   0.00
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
In [114]:
df['DAYS_LAST_PHONE_CHANGE'].unique()
Out[114]:
array([-1134.,  -828.,  -815., ..., -3988., -3899., -3538.])
In [115]:
df['DAYS_LAST_PHONE_CHANGE'].nunique()
Out[115]:
3773

Converting DAYS_LAST_PHONE_CHANGE to positive days

In [116]:
df['DAYS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].apply(lambda x: -x if x < 0 else x)
In [117]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts().head()
Out[117]:
0.00    37672
1.00     2812
2.00     2318
3.00     1763
4.00     1285
Name: DAYS_LAST_PHONE_CHANGE, dtype: int64

All Days in DAYS_LAST_PHONE_CHANGE have positive values

Creating a new column YEARS_LAST_PHONE_CHANGE for ease of analysis¶

In [118]:
df['YEARS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].apply(lambda x: round(x/365,0))

Automation functions¶

Outlier Analysis (Distplot + BoxPlot) - numerical columns¶

In [119]:
def fn_dist_box(dataset,column):
    plt.subplots(1,2 ,figsize = (20,8))

    plt.subplot(121)
    sns.distplot(dataset[column], color = 'purple')
    pltname = 'Distplot of ' + column
    plt.ticklabel_format(style='plain', axis='x')
    plt.title(pltname)

    plt.subplot(122)
#     sns.boxplot(y = column, data = dataset, fliersize = 10)
    red_diamond = dict(markerfacecolor='r', marker='D')
    sns.boxplot(y = column, data = dataset, flierprops = red_diamond)
#     plt.boxplot(dataset[column], flierprops = red_diamond)
    pltname = 'Boxplot of ' + column
    plt.title(pltname)

    plt.tight_layout(pad = 4)
    plt.show()

Creating a function age_cat to categorize YEARS_BIRTH¶

In [120]:
def age_cat(years):
    if years <= 20:
        return '0-20'
    elif years > 20 and years <= 30:
        return '20-30'
    elif years > 30 and years <= 40:
        return '30-40'  
    elif years > 40 and years <= 50:
        return '40-50'
    elif years > 50 and years <= 60:
        return '50-60'
    elif years > 60 and years <= 70:
        return '60-70'
    elif years > 70:
        return '70+'

Univariate Analysis (Countplot) - categorical columns¶

In [121]:
def fn_uni_countplot(column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")

    plt.subplot(1,2,1)
    pltname = column + ' of clients with payment difficulties'
    plt.title(pltname)
    sns.countplot(x = column, data = df1, order = sorted(df1[column].unique(), reverse = True), palette = palt)
    plt.xticks(rotation = 90)
    
    plt.subplot(1,2,2)
    pltname = column + ' of clients with on-time payments'
    plt.title(pltname)
    sns.countplot(x = column, data = df0, order = sorted(df0[column].unique(), reverse = True), palette = palt)
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

Univariate Analysis (Piechart) - categorical columns¶

In [122]:
def fn_uni_piechart(column):
    plt.figure(figsize = [20,12])

    plt.subplot(1,2,1)
    pltname = column + ' of clients with payment difficulties'
    plt.title(pltname)
    df1[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['green','yellow','purple','orange','red'], labeldistance=None)
    plt.legend()
    
    plt.subplot(1,2,2)
    pltname = column + ' of clients with on-time payments'
    plt.title(pltname)
    df0[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['green','yellow','purple','orange','red'], labeldistance=None)
    
    plt.legend()
    plt.tight_layout(pad = 4)
    plt.show()

Univariate Analysis (Barplot) - categorical columns¶

In [123]:
def fn_uni_barplot(column):
    plt.figure(figsize = [20,8])

    plt.subplot(1,2,1)
    (df1[column].value_counts(normalize=True)*100).plot.bar(title = column + " Payment difficulties", color=['black', 'red', 'green', 'blue', 'cyan'])
    plt.xticks(rotation=90)

    plt.subplot(1,2,2)
    (df0[column].value_counts(normalize=True)*100).plot.bar(title = column + " On-Time Payments", color=['black', 'red', 'green', 'blue', 'cyan'])
    plt.xticks(rotation=90)

    plt.tight_layout(pad = 4)
    plt.show()

Calculating min and max outlier range for numerical columns¶

In [124]:
def outlier_range(dataset,column):
    Q1 = dataset[column].quantile(0.25)
    Q3 = dataset[column].quantile(0.75)
    IQR = Q3 - Q1
    Min_value = (Q1 - 1.5 * IQR)
    Max_value = (Q3 + 1.5 * IQR)
    return Max_value

Bivariate Analysis (boxplot) - categorical V/S continuous variables¶

In [125]:
 def fn_bi_boxplot(categorical,continuous,max_continuous1,max_continuous0,hue_column):
    plt.figure(figsize = [20,12])

    plt.subplot(1,2,1)
    plt.title('Payment Difficulties')
    red_diamond = dict(markerfacecolor='r', marker='D')
    sns.boxplot(x = categorical, 
                y = df1[df1[continuous] < max_continuous1][continuous], 
                data = df1, 
                flierprops = red_diamond, 
                order = sorted(df1[categorical].unique(), reverse = True),
                hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
    plt.ticklabel_format(style='plain', axis='y')
    plt.xticks(rotation=90)

    plt.subplot(1,2,2)
    plt.title('On-Time Payments')
    sns.boxplot(x = categorical, 
                y = df0[df0[continuous] < max_continuous0][continuous], 
                data = df0, 
                flierprops = red_diamond, 
                order = sorted(df0[categorical].unique(), reverse = True),
                hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
    plt.ticklabel_format(style='plain', axis='y')
    plt.xticks(rotation=90)

    plt.tight_layout(pad = 4)
    plt.show()

Bivariate Analysis (Countplot) - categorical V/S categorical columns¶

In [126]:
def fn_bi_countplot(column,hue_column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")

    plt.subplot(1,2,1)
    pltname = 'Clients with payment difficulties'
    plt.title(pltname)
    sns.countplot(x = column, data = df1, 
                  order = sorted(df1[column].unique(), reverse = True), palette = palt,
                  hue = hue_column, hue_order = sorted(df1[hue_column].unique(), reverse = True))
    plt.xticks(rotation = 90)
    
    plt.subplot(1,2,2)
    pltname = 'Clients with on-time payments'
    plt.title(pltname)
    sns.countplot(x = column, data = df0, 
                  order = sorted(df0[column].unique(), reverse = True), palette = palt,
                  hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

Univariate Analysis (Countplot) - categorical columns - Merged dataset¶

In [127]:
def fn_uni_countplot_merge(column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")
    
    pltname = 'Analysis of ' + column
    plt.title(pltname)
    sns.countplot(x = column, data = df_merge, order = sorted(df_merge[column].unique(), reverse = True), palette = palt)
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

Univariate Analysis (Piechart) - categorical columns - Merged dataset¶

In [128]:
def fn_uni_piechart_merge(column):
    plt.figure(figsize = [10,6])

    pltname = 'Analysis of ' + column
    plt.title(pltname)
    df_merge[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['purple','orange','red','green','yellow','pink'], labeldistance=None)
    
    plt.legend()
    plt.tight_layout(pad = 4)
    plt.show()

Bivariate Analysis (boxplot) - categorical V/S continuous variables - Merged dataset¶

In [129]:
 def fn_bi_boxplot_merge(categorical,continuous,max_continuous,hue_column):
    plt.figure(figsize = [20,12])
    red_diamond = dict(markerfacecolor='r', marker='D')

    sns.boxplot(x = categorical, 
                y = df_merge[df_merge[continuous] < max_continuous][continuous], 
                data = df_merge, 
                flierprops = red_diamond, 
                order = sorted(df_merge[categorical].unique(), reverse = True),
                hue = hue_column, hue_order = sorted(df_merge[hue_column].unique(), reverse = True))
    plt.ticklabel_format(style='plain', axis='y')
    plt.xticks(rotation=90)

    plt.tight_layout(pad = 4)
    plt.show()

Bivariate Analysis (Countplot) - categorical V/S categorical columns - Merged dataset¶

In [130]:
def fn_bi_countplot_merge(column,hue_column):
    plt.figure(figsize = [20,8])
    palt = sns.color_palette("bright")

    sns.countplot(x = column, data = df_merge, 
                  order = sorted(df_merge[column].unique()), palette = palt,
                  hue = hue_column, hue_order = sorted(df_merge[hue_column].unique(), reverse = True))
    plt.xticks(rotation = 90)

    plt.tight_layout(pad = 4)
    plt.show()

Dealing with outliers for numerical columns¶

Analysis of CNT_CHILDREN column¶

In [131]:
df['CNT_CHILDREN'].value_counts().sort_values(ascending = False).head()
Out[131]:
0    215371
1     61119
2     26749
3      3717
4       429
Name: CNT_CHILDREN, dtype: int64
In [132]:
(df['CNT_CHILDREN'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[132]:
0   70.04
1   19.88
2    8.70
3    1.21
4    0.14
Name: CNT_CHILDREN, dtype: float64
In [133]:
fn_dist_box(df,'CNT_CHILDREN')
No description has been provided for this image

Calculating IQR (Inter Quartile range)

In [134]:
Q1 = df['CNT_CHILDREN'].quantile(0.25)
Q3 = df['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [135]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -1.5
Max value after which outlier exist: 2.5

Observations

  • Looking at the data, we can see that above 7 children, the count of applicants are very minimal (2 or 3 in each category)
  • Also, looking at the observation data for applicants with 10 children, the applicants are only 31 and 41 years respectively. This seems a one-off scenario and can be treated as an outlier
  • Both distplots and boxplots clearly show the values above 2.5 as being outliers

Conclusion

  • Applicants with 3 or more children are outlier cases

Analysis of AMT_INCOME_TOTAL column¶

In [136]:
df['AMT_INCOME_TOTAL'].value_counts().sort_values(ascending = False).head()
Out[136]:
135000.00    35750
112500.00    31019
157500.00    26556
180000.00    24719
90000.00     22483
Name: AMT_INCOME_TOTAL, dtype: int64
In [137]:
(df['AMT_INCOME_TOTAL'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[137]:
135000.00   11.63
112500.00   10.09
157500.00    8.64
180000.00    8.04
90000.00     7.31
Name: AMT_INCOME_TOTAL, dtype: float64
In [138]:
df['AMT_INCOME_TOTAL'].describe(percentiles = [0.75,0.99,0.999])
Out[138]:
count      307511.00
mean       168797.92
std        237123.15
min         25650.00
50%        147150.00
75%        202500.00
99%        472500.00
99.9%      900000.00
max     117000000.00
Name: AMT_INCOME_TOTAL, dtype: float64

Plotting for AMT_INCOME_TOTAL

In [139]:
fn_dist_box(df,'AMT_INCOME_TOTAL')
No description has been provided for this image
  • The end result charts are very thin and we are able to observe an outlier around ~120 million.
  • Let's plot it out by only considering income under 99.9% value which is 900K
In [140]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['AMT_INCOME_TOTAL'] < 900000].AMT_INCOME_TOTAL)
pltname = 'Distplot of ' + 'AMT_INCOME_TOTAL'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['AMT_INCOME_TOTAL'] < 900000].AMT_INCOME_TOTAL)
pltname = 'Boxplot of ' + 'AMT_INCOME_TOTAL'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Now, we are able to clearly make out the distribution and data range in both plots.

  • This means that values above 900K income are clearly outliers
In [141]:
df[df['AMT_INCOME_TOTAL'] > 900000].head()
Out[141]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEARS_BIRTH YEARS_EMPLOYED YEARS_REGISTRATION YEARS_ID_PUBLISH YEARS_LAST_PHONE_CHANGE
1504 101769 0 Revolving loans M Y Y 0 1080000.00 180000.00 9000.00 180000.00 Unaccompanied Commercial associate Higher education Married House / apartment 0.07 13633 5536 3.00 4971 1 1 0 1 0 0 Unknown 2.00 1 1 FRIDAY 13 0 1 1 0 0 0 Business Entity Type 3 0.65 0.46 0.00 0.00 0.00 0.00 491.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 37 15 0 14 1.00
1723 102015 0 Cash loans F N Y 0 1935000.00 269550.00 10534.50 225000.00 Unaccompanied Pensioner Secondary / secondary special Married House / apartment 0.01 18636 365243 3401.00 2168 1 0 0 1 1 0 Unknown 2.00 2 2 MONDAY 11 0 0 0 0 0 0 XNA 0.27 0.33 1.00 0.00 1.00 0.00 381.00 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 51 1001 9 6 1.00
3371 103938 0 Cash loans F N N 1 1350000.00 2410380.00 109053.00 2250000.00 NaN Commercial associate Higher education Married House / apartment 0.07 12213 843 757.00 4132 1 1 0 1 1 0 Managers 3.00 1 1 SUNDAY 13 0 1 1 0 0 0 Housing 0.72 0.57 0.00 0.00 0.00 0.00 2.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0.00 0.00 0.00 3.00 0.00 0.00 33 2 2 11 0.00
4603 105384 0 Revolving loans F Y Y 0 1350000.00 405000.00 20250.00 405000.00 Unaccompanied Commercial associate Higher education Married House / apartment 0.02 14781 4768 3469.00 4045 1 1 0 1 0 0 Managers 2.00 2 2 TUESDAY 12 0 0 0 0 0 0 Business Entity Type 3 0.66 0.73 0.00 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 40 13 10 11 0.00
5673 106637 0 Cash loans M Y Y 3 967500.00 450000.00 30073.50 450000.00 Unaccompanied Commercial associate Higher education Married House / apartment 0.05 11901 546 2932.00 215 1 1 0 1 0 0 Managers 5.00 1 1 MONDAY 14 0 0 0 1 1 0 Business Entity Type 3 0.49 0.45 0.00 0.00 0.00 0.00 88.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 1.00 1.00 0.00 33 1 8 1 0.00

Observations

  • Looking at the data, we can see that Income above 900K (99.9% value) are outliers
  • Both distplots and boxplots clearly show the same trend

Conclusion

  • Applicants with Income above 900K (99.9% value) are outliers

Analysis of CNT_FAM_MEMBERS column¶

In [142]:
df['CNT_FAM_MEMBERS'].value_counts().sort_values(ascending = False).head()
Out[142]:
2.00    158357
1.00     67847
3.00     52601
4.00     24697
5.00      3478
Name: CNT_FAM_MEMBERS, dtype: int64
In [143]:
(df['CNT_FAM_MEMBERS'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[143]:
2.00   51.50
1.00   22.06
3.00   17.11
4.00    8.03
5.00    1.13
Name: CNT_FAM_MEMBERS, dtype: float64
In [144]:
df['CNT_FAM_MEMBERS'].describe(percentiles = [0.75,0.99,0.9999])
Out[144]:
count    307509.00
mean          2.15
std           0.91
min           1.00
50%           2.00
75%           3.00
99%           5.00
99.99%        8.00
max          20.00
Name: CNT_FAM_MEMBERS, dtype: float64

Plotting for CNT_FAM_MEMBERS

In [145]:
fn_dist_box(df,'CNT_FAM_MEMBERS')
No description has been provided for this image
  • For family members count from 5 upwards, the results are sparse and there is a clear outlier around 20
  • Applicants with family members above 10 are only 2 or 3
  • Let's plot it out by only considering family members under 99.9% value which is 8
In [146]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['CNT_FAM_MEMBERS'] <= 8].CNT_FAM_MEMBERS)
pltname = 'Distplot of ' + 'CNT_FAM_MEMBERS'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['CNT_FAM_MEMBERS'] <= 8].CNT_FAM_MEMBERS)
pltname = 'Boxplot of ' + 'CNT_FAM_MEMBERS'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Now, we are able to clearly make out the distribution and data range in both plots.

  • This means that applicants with 5 or more family members are clearly outliers

Analysis of AMT_ANNUITY column¶

In [147]:
df['AMT_ANNUITY'].value_counts().sort_values(ascending = False).head()
Out[147]:
9000.00     6385
13500.00    5514
6750.00     2279
10125.00    2035
37800.00    1602
Name: AMT_ANNUITY, dtype: int64
In [148]:
(df['AMT_ANNUITY'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[148]:
9000.00    2.08
13500.00   1.79
6750.00    0.74
10125.00   0.66
37800.00   0.52
Name: AMT_ANNUITY, dtype: float64
In [149]:
df['AMT_ANNUITY'].describe(percentiles = [0.75,0.99,0.9999])
Out[149]:
count    307499.00
mean      27108.57
std       14493.74
min        1615.50
50%       24903.00
75%       34596.00
99%       70006.50
99.99%   213291.00
max      258025.50
Name: AMT_ANNUITY, dtype: float64

Calculating IQR (Inter Quartile range)

In [150]:
Q1 = df['AMT_ANNUITY'].quantile(0.25)
Q3 = df['AMT_ANNUITY'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
18072.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [151]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -10584.0
Max value after which outlier exist: 61704.0

Plotting for AMT_ANNUITY

In [152]:
fn_dist_box(df,'AMT_ANNUITY')
No description has been provided for this image
  • As observed from displot and boxplot, the outliers tend to exist after 61704.0 (This value is derived using Max_value using IQR formulae)
  • We can verify the same by reproducing the same plots under outlier value as shown below
In [153]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['AMT_ANNUITY'] <= 61704.0].AMT_ANNUITY)
pltname = 'Distplot of ' + 'AMT_ANNUITY'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['AMT_ANNUITY'] <= 61704.0].AMT_ANNUITY)
pltname = 'Boxplot of ' + 'AMT_ANNUITY'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Now, we are able to clearly make out the distribution and data range in both plots

  • This means that values above 61704.0 AMT_ANNUITY are clearly outliers

Observations

  • As observed from displot and boxplot, the outliers tend to exist after 61704 (Outlier value is derived using Max_value using IQR formulae)

Conclusion

  • Applicants with AMT_ANNUITY above 61704 (calculated using IQR) are outliers

Analysis of AMT_CREDIT column¶

In [154]:
df['AMT_CREDIT'].value_counts().sort_values(ascending = False).head()
Out[154]:
450000.00    9709
675000.00    8877
225000.00    8162
180000.00    7342
270000.00    7241
Name: AMT_CREDIT, dtype: int64
In [155]:
(df['AMT_CREDIT'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[155]:
450000.00   3.16
675000.00   2.89
225000.00   2.65
180000.00   2.39
270000.00   2.35
Name: AMT_CREDIT, dtype: float64
In [156]:
df['AMT_CREDIT'].describe(percentiles = [0.75,0.99,0.9999])
Out[156]:
count     307511.00
mean      599026.00
std       402490.78
min        45000.00
50%       513531.00
75%       808650.00
99%      1854000.00
99.99%   3030530.76
max      4050000.00
Name: AMT_CREDIT, dtype: float64

Calculating IQR (Inter Quartile range)

In [157]:
Q1 = df['AMT_CREDIT'].quantile(0.25)
Q3 = df['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
538650.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [158]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -537975.0
Max value after which outlier exist: 1616625.0

Plotting for AMT_CREDIT

In [159]:
fn_dist_box(df,'AMT_CREDIT')
No description has been provided for this image
  • As observed from displot and boxplot, the outliers tend to exist after 1616625.0 (This value is derived using Max_value using IQR formulae)
  • We can verify the same by reproducing the same plots under outlier value as shown below
In [160]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['AMT_CREDIT'] <= 1616625.0].AMT_CREDIT)
pltname = 'Distplot of ' + 'AMT_CREDIT'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['AMT_CREDIT'] <= 1616625.0].AMT_CREDIT)
pltname = 'Boxplot of ' + 'AMT_CREDIT'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Now, we are able to clearly make out the distribution and data range in both plots

  • This means that values above 61616625.0 AMT_CREDIT are clearly outliers

Observations

  • As observed from displot and boxplot, the outliers tend to exist after 1616625.0 (Outlier value is derived using Max_value using IQR formulae)

Conclusion

  • Applicants with AMT_CREDIT above 1616625.0 (calculated using IQR) are outliers

Analysis of DAYS_LAST_PHONE_CHANGE column¶

In [161]:
df['DAYS_LAST_PHONE_CHANGE'].value_counts().sort_values(ascending = False).head()
Out[161]:
0.00    37672
1.00     2812
2.00     2318
3.00     1763
4.00     1285
Name: DAYS_LAST_PHONE_CHANGE, dtype: int64
In [162]:
(df['DAYS_LAST_PHONE_CHANGE'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[162]:
0.00   12.25
1.00    0.91
2.00    0.75
3.00    0.57
4.00    0.42
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
In [163]:
df['DAYS_LAST_PHONE_CHANGE'].describe(percentiles = [0.75,0.99,0.9999])
Out[163]:
count    307510.00
mean        962.86
std         826.81
min           0.00
50%         757.00
75%        1570.00
99%        3149.00
99.99%     3999.75
max        4292.00
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64

Calculating IQR (Inter Quartile range)

In [164]:
Q1 = df['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)
Q3 = df['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1296.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [165]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -1670.0
Max value after which outlier exist: 3514.0

Plotting for DAYS_LAST_PHONE_CHANGE

In [166]:
fn_dist_box(df,'DAYS_LAST_PHONE_CHANGE')
No description has been provided for this image
  • As observed from displot and boxplot, the outliers tend to exist after 3514.0 (This value is derived using Max_value using IQR formulae)
  • We can verify the same by reproducing the same plots under outlier value as shown below
In [167]:
plt.subplots(1,2 ,figsize = (20,8))

plt.subplot(121)
sns.distplot(df[df['DAYS_LAST_PHONE_CHANGE'] <= 3514.0].DAYS_LAST_PHONE_CHANGE)
pltname = 'Distplot of ' + 'DAYS_LAST_PHONE_CHANGE'
plt.title(pltname)

plt.subplot(122)
sns.boxplot(df[df['DAYS_LAST_PHONE_CHANGE'] <= 3514.0].DAYS_LAST_PHONE_CHANGE)
pltname = 'Boxplot of ' + 'DAYS_LAST_PHONE_CHANGE'
plt.title(pltname)

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Now, we are able to clearly make out the distribution and data range in both plots

  • This means that values above 3514.0 DAYS_LAST_PHONE_CHANGE are clearly outliers

Observations

  • As observed from displot and boxplot, the outliers tend to exist after 3514.0 (Outlier value is derived using Max_value using IQR formulae)

Conclusion

  • Applicants with DAYS_LAST_PHONE_CHANGE above 3514.0 (calculated using IQR) are outliers

Binning of continuous columns for analysis¶

Categorizing AMT_GOODS_PRICE column¶

In [168]:
df['AMT_GOODS_PRICE'].value_counts().sort_values(ascending = False).head()
Out[168]:
450000.00    26022
225000.00    25282
675000.00    24962
900000.00    15416
270000.00    11428
Name: AMT_GOODS_PRICE, dtype: int64
In [169]:
(df['AMT_GOODS_PRICE'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[169]:
450000.00   8.47
225000.00   8.23
675000.00   8.12
900000.00   5.02
270000.00   3.72
Name: AMT_GOODS_PRICE, dtype: float64

Getting Statistical summary for AMT_GOODS_PRICE

In [170]:
df['AMT_GOODS_PRICE'].describe(percentiles = [0.25,0.75,0.99,0.9999])
Out[170]:
count     307233.00
mean      538396.21
std       369446.46
min        40500.00
25%       238500.00
50%       450000.00
75%       679500.00
99%      1800000.00
99.99%   2934964.80
max      4050000.00
Name: AMT_GOODS_PRICE, dtype: float64

Categorize the values in AMT_GOODS_PRICE into 5 bins and create a new column AMT_GOODS_PRICE_CATEGORY

In [171]:
df['AMT_GOODS_PRICE_CATEGORY'] = pd.cut(df['AMT_GOODS_PRICE'], bins = 5, labels = ['very low','low','medium','high','very high'])

Checking that the values are populated as per expectation

In [172]:
df['AMT_GOODS_PRICE_CATEGORY'].value_counts()
Out[172]:
very low     246973
low           56087
medium         4108
high             47
very high        18
Name: AMT_GOODS_PRICE_CATEGORY, dtype: int64

Categorizing YEARS_BIRTH column¶

We will be categorizing 'YEARS_BIRTH' instead of 'DAYS_BIRTH' as years are easier to interpret than days

In [173]:
df['YEARS_BIRTH'].value_counts().sort_values(ascending = False).head()
Out[173]:
39    9023
38    8793
37    8727
40    8602
41    8495
Name: YEARS_BIRTH, dtype: int64
In [174]:
(df['YEARS_BIRTH'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[174]:
39   2.93
38   2.86
37   2.84
40   2.80
41   2.76
Name: YEARS_BIRTH, dtype: float64

Getting Statistical summary for YEARS_BIRTH

In [175]:
df['YEARS_BIRTH'].describe(percentiles = [0.25,0.75,0.99,0.9999])
Out[175]:
count    307511.00
mean         43.94
std          11.96
min          21.00
25%          34.00
50%          43.00
75%          54.00
99%          67.00
99.99%       69.00
max          69.00
Name: YEARS_BIRTH, dtype: float64

Categorize the values from YEARS_BIRTH into a new column YEARS_BIRTH_CATEGORY

In [176]:
df['YEARS_BIRTH_CATEGORY'] = df['YEARS_BIRTH'].apply(age_cat)

Checking that the values are populated as per expectation

In [177]:
df['YEARS_BIRTH_CATEGORY'].value_counts().sort_values(ascending = False)
Out[177]:
30-40    82770
40-50    75509
50-60    67955
20-30    48869
60-70    32408
Name: YEARS_BIRTH_CATEGORY, dtype: int64

Categorizing YEARS_REGISTRATION column¶

We will be categorizing 'YEARS_REGISTRATION' instead of 'DAYS_REGISTRATION' as years are easier to interpret than days

In [178]:
df['YEARS_REGISTRATION'].value_counts().sort_values(ascending = False).head()
Out[178]:
1     15661
2     15564
3     13894
13    12866
12    12697
Name: YEARS_REGISTRATION, dtype: int64
In [179]:
(df['YEARS_REGISTRATION'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[179]:
1    5.09
2    5.06
3    4.52
13   4.18
12   4.13
Name: YEARS_REGISTRATION, dtype: float64

Getting Statistical summary for YEARS_REGISTRATION

In [180]:
df['YEARS_REGISTRATION'].describe(percentiles = [0.25,0.75,0.99,0.9999])
Out[180]:
count    307511.00
mean         13.66
std           9.66
min           0.00
25%           6.00
50%          12.00
75%          20.00
99%          38.00
99.99%       54.25
max          68.00
Name: YEARS_REGISTRATION, dtype: float64

Categorize the values from YEARS_REGISTRATION into a new column YEARS_REGISTRATION_CATEGORY

In [181]:
df['YEARS_REGISTRATION_CATEGORY'] = df['YEARS_REGISTRATION'].apply(age_cat)

Checking that the values are populated as per expectation

In [182]:
df['YEARS_REGISTRATION_CATEGORY'].value_counts().sort_values(ascending = False)
Out[182]:
0-20     230697
20-30     59288
30-40     16051
40-50      1406
50-60        60
60-70         9
Name: YEARS_REGISTRATION_CATEGORY, dtype: int64

Checking Imbalance for target column `TARGET'¶

Analyzing TARGET column¶

In [183]:
df['TARGET'].value_counts().sort_values(ascending = False)
Out[183]:
0    282686
1     24825
Name: TARGET, dtype: int64
In [184]:
df['TARGET'].value_counts(normalize = True).sort_values(ascending = False) * 100
Out[184]:
0   91.93
1    8.07
Name: TARGET, dtype: float64
In [185]:
plt.figure(figsize = (10,5))
sns.countplot(x = df['TARGET'], data = df)
plt.title('Checking imbalance ratio of TARGET variable')
plt.show()
No description has been provided for this image
In [186]:
plt.figure(figsize = [20,8])
(df['TARGET'].value_counts(normalize=True)*100).plot.bar(title = "Checking imbalance percentage for TARGET variable", color=['black', 'red', 'green', 'blue', 'cyan'])
plt.xticks(rotation=0)
plt.show()
No description has been provided for this image

Observations

  • We have imbalance in TARGET variable based on the % of observations
  • TARGET value 1 represents client with payment difficulties (he/she had late payment more than X days on at least one of the first Y installments of the loan). This is only 8.07% of the data
  • TARGET value 0 represents all other cases than 1. This is 91.93% of the data

Data split based on TARGET¶

Create new dataframe with TARGET value 1¶

  • TARGET value 1 represents client with payment difficulties (he/she had late payment more than X days on at least one of the first Y installments of the loan). This is only 8.07% of the data
In [187]:
df1 = df[df['TARGET'] == 1]
In [188]:
df1.TARGET.value_counts()
Out[188]:
1    24825
Name: TARGET, dtype: int64

Create new dataframe with TARGET value 0¶

  • TARGET value 0 represents all other cases than 1. This is 91.93% of the data
In [189]:
df0 = df[df['TARGET'] == 0]
In [190]:
df0.TARGET.value_counts()
Out[190]:
0    282686
Name: TARGET, dtype: int64

Univariate analysis of categorical variables¶

Analysis of NAME_CONTRACT_TYPE¶

NAME_CONTRACT_TYPE with payment difficulties

In [191]:
df1['NAME_CONTRACT_TYPE'].value_counts().sort_values(ascending = False)
Out[191]:
Cash loans         23221
Revolving loans     1604
Name: NAME_CONTRACT_TYPE, dtype: int64

NAME_CONTRACT_TYPE with on-time payments

In [192]:
df0['NAME_CONTRACT_TYPE'].value_counts().sort_values(ascending = False)
Out[192]:
Cash loans         255011
Revolving loans     27675
Name: NAME_CONTRACT_TYPE, dtype: int64
In [193]:
fn_uni_countplot('NAME_CONTRACT_TYPE')
No description has been provided for this image
In [194]:
fn_uni_piechart('NAME_CONTRACT_TYPE')
No description has been provided for this image

Observations

  • Looking at CountPlot, we don't see significant differences in NAME_CONTRACT_TYPE b/w clients with payment difficulties and on-time payments
  • Looking at PieChart, we don't see significant differences in NAME_CONTRACT_TYPE b/w clients with payment difficulties and on-time payments

Conclusion

  • NAME_CONTRACT_TYPE column does not provide any conclusive evidence in favor of clients with payment difficulties OR on-time payments

Analysis of CODE_GENDER¶

CODE_GENDER with payment difficulties

In [195]:
df1['CODE_GENDER'].value_counts().sort_values(ascending = False)
Out[195]:
F    14170
M    10655
Name: CODE_GENDER, dtype: int64

CODE_GENDER with on-time payments

In [196]:
df0['CODE_GENDER'].value_counts().sort_values(ascending = False)
Out[196]:
F    188282
M     94404
Name: CODE_GENDER, dtype: int64
In [197]:
sorted(df['CODE_GENDER'].unique(), reverse = True)
Out[197]:
['M', 'F']
In [198]:
fn_uni_countplot('CODE_GENDER')
No description has been provided for this image
In [199]:
fn_uni_piechart('CODE_GENDER')
No description has been provided for this image

Observations

  • Looking at CountPlot and PieChart, there is a 9.5% decrease in "Male" values from CODE_GENDER b/w clients with payment difficulties to on-time payments. It's a weak correlation that Males have more payment difficulties.

Conclusion

  • CODE_GENDER column provides a weak inference that "Male" clients have more payment difficulties

Analysis of FLAG_OWN_CAR¶

FLAG_OWN_CAR with payment difficulties

In [200]:
df1['FLAG_OWN_CAR'].value_counts().sort_values(ascending = False)
Out[200]:
N    17249
Y     7576
Name: FLAG_OWN_CAR, dtype: int64

FLAG_OWN_CAR with on-time payments

In [201]:
df0['FLAG_OWN_CAR'].value_counts().sort_values(ascending = False)
Out[201]:
N    185675
Y     97011
Name: FLAG_OWN_CAR, dtype: int64
In [202]:
fn_uni_countplot('FLAG_OWN_CAR')
No description has been provided for this image
In [203]:
fn_uni_piechart('FLAG_OWN_CAR')
No description has been provided for this image

Observations

  • Looking at CountPlot, we don't see significant differences in FLAG_OWN_CAR b/w clients with payment difficulties and on-time payments
  • Looking at PieChart, we don't see significant differences in FLAG_OWN_CAR b/w clients with payment difficulties and on-time payments.

Conclusion

  • FLAG_OWN_CAR column does not provide any conclusive evidence in favor of clients with payment difficulties OR on-time payments

Analysis of NAME_INCOME_TYPE¶

NAME_INCOME_TYPE with payment difficulties

In [204]:
df1['NAME_INCOME_TYPE'].value_counts().sort_values(ascending = False)
Out[204]:
Working                 15224
Commercial associate     5360
Pensioner                2982
State servant            1249
Unemployed                  8
Maternity leave             2
Name: NAME_INCOME_TYPE, dtype: int64

NAME_INCOME_TYPE with on-time payments

In [205]:
df0['NAME_INCOME_TYPE'].value_counts().sort_values(ascending = False)
Out[205]:
Working                 143550
Commercial associate     66257
Pensioner                52380
State servant            20454
Student                     18
Unemployed                  14
Businessman                 10
Maternity leave              3
Name: NAME_INCOME_TYPE, dtype: int64
In [206]:
fn_uni_countplot('NAME_INCOME_TYPE')
No description has been provided for this image
In [207]:
fn_uni_barplot('NAME_INCOME_TYPE')
No description has been provided for this image

Observations

  • Pensioners have better on-time payments
  • Students don't have Payment difficulties
  • Businessmen don't have Payment difficulties

Conclusion

  • Pensioners have better on-time payments. This is a weak correlation.
  • Students don't have Payment difficulties. In this case, total students have only 18 observations and should be treated as a weak correlation
  • Businessmen don't have Payment difficulties. In this case, Businessmen have only 10 observations and should be treated as a weak correlation

Analysis of NAME_EDUCATION_TYPE¶

NAME_EDUCATION_TYPE with payment difficulties

In [208]:
df1['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending = False)
Out[208]:
Secondary / secondary special    19524
Higher education                  4009
Incomplete higher                  872
Lower secondary                    417
Academic degree                      3
Name: NAME_EDUCATION_TYPE, dtype: int64

NAME_EDUCATION_TYPE with on-time payments

In [209]:
df0['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending = False)
Out[209]:
Secondary / secondary special    198867
Higher education                  70854
Incomplete higher                  9405
Lower secondary                    3399
Academic degree                     161
Name: NAME_EDUCATION_TYPE, dtype: int64
In [210]:
fn_uni_countplot('NAME_EDUCATION_TYPE')
No description has been provided for this image
In [211]:
fn_uni_barplot('NAME_EDUCATION_TYPE')
No description has been provided for this image
In [212]:
fn_uni_piechart('NAME_EDUCATION_TYPE')
No description has been provided for this image

Observations

  • Clients with 'Higher education' have better on-timepayments than payment difficulties
  • Remaining categories don't provide any conclusive results

Conclusion

  • Clients with 'Higher education' have less payment difficulties. However, this is a weak correlation

Analysis of NAME_FAMILY_STATUS¶

NAME_FAMILY_STATUS with payment difficulties

In [213]:
df1['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending = False)
Out[213]:
Married                 14850
Single / not married     4457
Civil marriage           2961
Separated                1620
Widow                     937
Name: NAME_FAMILY_STATUS, dtype: int64

NAME_FAMILY_STATUS with on-time payments

In [214]:
df0['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending = False)
Out[214]:
Married                 181584
Single / not married     40987
Civil marriage           26814
Separated                18150
Widow                    15151
Name: NAME_FAMILY_STATUS, dtype: int64
In [215]:
fn_uni_countplot('NAME_FAMILY_STATUS')
No description has been provided for this image
In [216]:
fn_uni_barplot('NAME_FAMILY_STATUS')
No description has been provided for this image
In [217]:
fn_uni_piechart('NAME_FAMILY_STATUS')
No description has been provided for this image

Observations

  • Clients who are 'Married' are 59.8% with payment difficulties and 64.2% with on-timepayments
  • Clients who are 'Widow' are 3.8% with payment difficulties and 5.4% with on-timepayments
  • Clients who are 'Single/not married' are 18.0% with payment difficulties and 14.5% with on-timepayments
  • Remaining categories don't provide any conclusive results

Conclusion

  • Clients who are 'Married' OR 'Widow' do on-time payments better comparatively. However, this is a weak correlation.
  • Clients who are 'Single/not married' have more difficulties with on-time payments comparatively. However, this is a weak correlation.

Correlation analysis of numerical variables¶

Plotting correlation matrix for Payment Difficulties¶

In [218]:
df1.select_dtypes(include=["int64","float64"]).shape
Out[218]:
(24825, 66)

There are 66 numerical columns. Creating a correlation matrix corr to view the results better

In [219]:
corr_df1 = df1.select_dtypes(include=["int64","float64"]).corr()
In [220]:
corr_df1.head()
Out[220]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEARS_BIRTH YEARS_EMPLOYED YEARS_REGISTRATION YEARS_ID_PUBLISH YEARS_LAST_PHONE_CHANGE
SK_ID_CURR 1.00 NaN -0.01 -0.01 -0.00 -0.01 -0.00 0.01 0.00 -0.01 -0.01 0.00 NaN 0.01 0.00 -0.00 0.02 0.00 -0.00 -0.01 -0.00 0.01 -0.00 0.00 0.00 0.01 0.00 -0.00 0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.00 -0.00 -0.00 NaN 0.00 0.00 0.00 0.00 -0.01 NaN -0.00 NaN -0.00 0.00 0.01 -0.00 -0.01 -0.01 -0.01 0.00 0.01 -0.01 -0.01 -0.00 0.01 -0.00 0.01 0.00 -0.01 -0.01 0.00 -0.00
TARGET NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
CNT_CHILDREN -0.01 NaN 1.00 0.00 -0.00 0.03 -0.01 -0.03 -0.26 -0.19 -0.15 0.03 NaN 0.19 0.05 0.00 -0.02 0.01 0.89 0.04 0.04 -0.02 -0.02 -0.02 -0.01 -0.00 0.05 0.05 -0.01 -0.02 0.03 0.00 0.03 -0.01 -0.01 -0.01 0.06 NaN -0.02 -0.13 -0.00 0.03 0.00 NaN -0.01 NaN 0.00 -0.00 0.00 -0.00 -0.01 0.00 -0.00 -0.00 0.00 -0.00 -0.01 -0.01 -0.01 -0.02 -0.04 -0.26 -0.19 -0.15 0.03 -0.01
AMT_INCOME_TOTAL -0.01 NaN 0.00 1.00 0.04 0.05 0.04 0.01 -0.00 -0.01 -0.00 0.00 NaN 0.01 -0.01 -0.00 -0.00 0.01 0.01 -0.02 -0.02 0.01 0.01 0.01 0.01 -0.00 -0.00 -0.00 0.01 -0.02 -0.00 -0.01 -0.00 -0.00 0.00 0.00 0.00 NaN -0.00 -0.01 -0.00 0.02 0.00 NaN -0.00 NaN 0.01 0.00 0.00 0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 0.00 -0.00 0.00 -0.00 -0.01 0.00 0.00 0.00
AMT_CREDIT -0.00 NaN -0.00 0.04 1.00 0.75 0.98 0.07 0.14 0.00 0.03 0.05 NaN -0.00 -0.06 0.03 0.01 0.01 0.05 -0.06 -0.07 0.03 0.02 0.03 0.03 -0.03 -0.04 -0.02 0.12 0.08 0.02 -0.03 0.02 -0.03 0.11 0.03 0.06 NaN -0.01 -0.01 -0.01 0.06 0.03 NaN 0.04 NaN 0.02 0.06 0.03 0.05 -0.00 0.01 0.01 0.02 -0.02 -0.01 0.00 0.01 0.06 -0.02 -0.04 0.14 0.00 0.03 0.05 0.11

Creating a HeatMap to view the correlations above 80% and 99.99%

In [221]:
plt.figure(figsize = (25,25))
sns.heatmap(data = corr_df1[(corr_df1 >= 0.8) & (corr_df1 < 0.9999)], annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()
No description has been provided for this image

Getting top 10 correlations for Payment Difficulties¶

In [222]:
corr_df1[corr_df1 <= 0.99].unstack().sort_values(ascending = False).head(22)
Out[222]:
AMT_GOODS_PRICE              AMT_CREDIT                    0.98
AMT_CREDIT                   AMT_GOODS_PRICE               0.98
REGION_RATING_CLIENT_W_CITY  REGION_RATING_CLIENT          0.96
REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY   0.96
CNT_FAM_MEMBERS              CNT_CHILDREN                  0.89
CNT_CHILDREN                 CNT_FAM_MEMBERS               0.89
DEF_60_CNT_SOCIAL_CIRCLE     DEF_30_CNT_SOCIAL_CIRCLE      0.87
DEF_30_CNT_SOCIAL_CIRCLE     DEF_60_CNT_SOCIAL_CIRCLE      0.87
LIVE_REGION_NOT_WORK_REGION  REG_REGION_NOT_WORK_REGION    0.85
REG_REGION_NOT_WORK_REGION   LIVE_REGION_NOT_WORK_REGION   0.85
LIVE_CITY_NOT_WORK_CITY      REG_CITY_NOT_WORK_CITY        0.78
REG_CITY_NOT_WORK_CITY       LIVE_CITY_NOT_WORK_CITY       0.78
AMT_ANNUITY                  AMT_GOODS_PRICE               0.75
AMT_GOODS_PRICE              AMT_ANNUITY                   0.75
AMT_ANNUITY                  AMT_CREDIT                    0.75
AMT_CREDIT                   AMT_ANNUITY                   0.75
FLAG_DOCUMENT_6              YEARS_EMPLOYED                0.62
YEARS_EMPLOYED               FLAG_DOCUMENT_6               0.62
DAYS_EMPLOYED                FLAG_DOCUMENT_6               0.62
FLAG_DOCUMENT_6              DAYS_EMPLOYED                 0.62
DAYS_EMPLOYED                DAYS_BIRTH                    0.58
DAYS_BIRTH                   DAYS_EMPLOYED                 0.58
dtype: float64

As we have duplicate combinations, looking at the above and removing dups, we get top 10 correlations as:

  • AMT_GOODS_PRICE AMT_CREDIT 0.98
  • REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.96
  • CNT_FAM_MEMBERS CNT_CHILDREN 0.89
  • DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.87
  • REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.85
  • LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.78
  • AMT_ANNUITY AMT_GOODS_PRICE 0.75
  • AMT_ANNUITY AMT_CREDIT 0.75
  • DAYS_EMPLOYED FLAG_DOCUMENT_6 0.62
  • DAYS_BIRTH DAYS_EMPLOYED 0.58

Plotting correlation matrix for On-Time payments¶

In [223]:
df0.select_dtypes(include=["int64","float64"]).shape
Out[223]:
(282686, 66)

There are 66 numerical columns. Creating a correlation matrix corr to view the results better

In [224]:
corr_df0 = df0.select_dtypes(include=["int64","float64"]).corr()
In [225]:
corr_df0.head()
Out[225]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEARS_BIRTH YEARS_EMPLOYED YEARS_REGISTRATION YEARS_ID_PUBLISH YEARS_LAST_PHONE_CHANGE
SK_ID_CURR 1.00 NaN -0.00 0.00 -0.00 0.00 -0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.00 -0.00 0.00 0.00 0.00 -0.00 -0.00 -0.00 -0.00 0.00 0.00 0.00 -0.00 -0.00 0.00 0.00 0.00 -0.00 0.00 -0.00 0.00 0.00 0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 -0.00 0.00 0.00 0.00 0.00 -0.00 -0.00 -0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
TARGET NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
CNT_CHILDREN -0.00 NaN 1.00 0.03 0.00 0.02 -0.00 -0.02 -0.34 -0.25 -0.19 0.03 0.00 0.24 0.06 -0.00 -0.03 0.02 0.88 0.02 0.02 -0.01 -0.01 0.01 0.02 0.02 0.07 0.07 -0.02 -0.04 0.01 -0.00 0.01 -0.00 0.01 0.00 0.06 -0.00 -0.02 -0.16 -0.00 0.05 -0.00 -0.00 -0.01 0.00 0.00 -0.01 0.00 0.01 0.00 0.00 0.00 0.00 -0.00 -0.00 0.00 -0.00 -0.01 -0.01 -0.04 -0.34 -0.25 -0.19 0.03 0.01
AMT_INCOME_TOTAL 0.00 NaN 0.03 1.00 0.34 0.42 0.35 0.17 -0.06 -0.14 -0.06 -0.02 0.00 0.14 -0.03 -0.02 0.00 0.08 0.03 -0.19 -0.20 0.08 0.07 0.14 0.13 0.01 0.02 0.02 0.14 -0.07 -0.03 -0.03 -0.03 -0.03 0.04 -0.00 -0.04 0.00 0.00 -0.10 0.01 0.16 0.04 0.00 0.01 0.01 0.05 0.05 0.02 0.02 0.01 0.01 0.01 0.00 -0.00 0.00 0.01 0.01 0.06 0.01 0.03 -0.06 -0.14 -0.06 -0.02 0.04
AMT_CREDIT -0.00 NaN 0.00 0.34 1.00 0.77 0.99 0.10 0.05 -0.07 -0.01 0.00 0.00 0.07 -0.02 0.02 0.03 0.02 0.06 -0.10 -0.11 0.05 0.02 0.05 0.05 -0.03 -0.02 0.00 0.13 0.04 -0.00 -0.02 -0.00 -0.02 0.07 0.01 0.10 0.00 -0.01 -0.05 -0.00 0.08 0.02 -0.00 0.03 0.00 0.05 0.05 0.03 0.06 0.01 0.04 0.02 0.03 -0.02 -0.00 0.00 -0.00 0.05 0.02 -0.05 0.05 -0.07 -0.01 0.00 0.07

Creating a HeatMap to view the correlations above 80% and 99.99%

In [226]:
plt.figure(figsize = (25,25))
sns.heatmap(data = corr_df0[(corr_df0 >= 0.8) & (corr_df0 < 0.9999)], annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()
No description has been provided for this image

Getting top 10 correlations for On-Time Payments¶

In [227]:
corr_df0[corr_df0 <= 0.99].unstack().sort_values(ascending = False).head(28)
Out[227]:
AMT_CREDIT                   AMT_GOODS_PRICE               0.99
AMT_GOODS_PRICE              AMT_CREDIT                    0.99
REGION_RATING_CLIENT_W_CITY  REGION_RATING_CLIENT          0.95
REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY   0.95
CNT_CHILDREN                 CNT_FAM_MEMBERS               0.88
CNT_FAM_MEMBERS              CNT_CHILDREN                  0.88
REG_REGION_NOT_WORK_REGION   LIVE_REGION_NOT_WORK_REGION   0.86
LIVE_REGION_NOT_WORK_REGION  REG_REGION_NOT_WORK_REGION    0.86
DEF_60_CNT_SOCIAL_CIRCLE     DEF_30_CNT_SOCIAL_CIRCLE      0.86
DEF_30_CNT_SOCIAL_CIRCLE     DEF_60_CNT_SOCIAL_CIRCLE      0.86
REG_CITY_NOT_WORK_CITY       LIVE_CITY_NOT_WORK_CITY       0.83
LIVE_CITY_NOT_WORK_CITY      REG_CITY_NOT_WORK_CITY        0.83
AMT_ANNUITY                  AMT_GOODS_PRICE               0.78
AMT_GOODS_PRICE              AMT_ANNUITY                   0.78
AMT_CREDIT                   AMT_ANNUITY                   0.77
AMT_ANNUITY                  AMT_CREDIT                    0.77
DAYS_EMPLOYED                DAYS_BIRTH                    0.63
DAYS_BIRTH                   DAYS_EMPLOYED                 0.63
                             YEARS_EMPLOYED                0.63
YEARS_EMPLOYED               DAYS_BIRTH                    0.63
DAYS_EMPLOYED                YEARS_BIRTH                   0.63
YEARS_BIRTH                  DAYS_EMPLOYED                 0.63
YEARS_EMPLOYED               YEARS_BIRTH                   0.63
YEARS_BIRTH                  YEARS_EMPLOYED                0.63
FLAG_DOCUMENT_6              YEARS_EMPLOYED                0.60
YEARS_EMPLOYED               FLAG_DOCUMENT_6               0.60
FLAG_DOCUMENT_6              DAYS_EMPLOYED                 0.60
DAYS_EMPLOYED                FLAG_DOCUMENT_6               0.60
dtype: float64

As we have duplicate combinations, looking at the above and removing dups, we get top 10 correlations as:

  • AMT_GOODS_PRICE AMT_CREDIT 0.99
  • REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.95
  • CNT_FAM_MEMBERS CNT_CHILDREN 0.88
  • REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.86
  • DEF_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE 0.86
  • LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.83
  • AMT_ANNUITY AMT_GOODS_PRICE 0.78
  • AMT_ANNUITY AMT_CREDIT 0.77
  • DAYS_BIRTH DAYS_EMPLOYED 0.63
  • DAYS_EMPLOYED FLAG_DOCUMENT_6 0.60

Comparing top 10 correlations b/w Payment difficulties and On-Time Payments¶

Observations

  • The top 10 correlations for Payment difficulties and On-Time Payments are the same except minor differences in correlation percentage
  • The highest correlation is for the combination of AMT_GOODS_PRICE and AMT_CREDIT
  • For Payment difficulties dataset, the correlation b/w AMT_GOODS_PRICE and AMT_CREDIT is 0.98
  • For On-Time Payments dataset, the correlation b/w AMT_GOODS_PRICE and AMT_CREDIT is 0.99

Univariate analysis of numerical variables¶

Analysis of AMT_CREDIT¶

Outlier identification of AMT_CREDIT with Payment difficulties¶

Calculating IQR (Inter Quartile range)

In [228]:
Q1 = df1['AMT_CREDIT'].quantile(0.25)
Q3 = df1['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
448915.5

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [229]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -388973.25
Max value after which outlier exist: 1406688.75

Outlier identification of AMT_CREDIT with On-Time Payments¶

Calculating IQR (Inter Quartile range)

In [230]:
Q1 = df0['AMT_CREDIT'].quantile(0.25)
Q3 = df0['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
540000.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [231]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -540000.0
Max value after which outlier exist: 1620000.0

Removing outliers and plotting distplot

In [232]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_CREDIT'] <= Max_value1].AMT_CREDIT,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_CREDIT'] <= Max_value0].AMT_CREDIT,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • For AMT_CREDIT between 250000 and approximately 650000, there are more clients with Payment difficulties
  • For AMT_CREDIT > 750000 , there are more clients with On-Time Payments

Analysis of YEARS_BIRTH¶

Outlier identification of YEARS_BIRTH with Payment difficulties¶

Calculating IQR (Inter Quartile range)

In [233]:
Q1 = df1['YEARS_BIRTH'].quantile(0.25)
Q3 = df1['YEARS_BIRTH'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
18.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [234]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: 4.0
Max value after which outlier exist: 76.0

Outlier identification of YEARS_BIRTH with On-Time Payments¶

Calculating IQR (Inter Quartile range)

In [235]:
Q1 = df0['YEARS_BIRTH'].quantile(0.25)
Q3 = df0['YEARS_BIRTH'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
20.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [236]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: 4.0
Max value after which outlier exist: 84.0

Removing outliers and plotting distplot

In [237]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['YEARS_BIRTH'] <= Max_value1].YEARS_BIRTH,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['YEARS_BIRTH'] <= Max_value0].YEARS_BIRTH,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • For YEARS_BIRTH between 20 and 40, there are more clients with Payment difficulties
  • Conversely, for YEARS_BIRTH > 40 , there are more clients with On-Time Payments

Analysis of AMT_GOODS_PRICE ¶

Outlier identification of AMT_GOODS_PRICE with Payment difficulties¶

Calculating IQR (Inter Quartile range)

In [238]:
Q1 = df1['AMT_GOODS_PRICE'].quantile(0.25)
Q3 = df1['AMT_GOODS_PRICE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
436500.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [239]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -416250.0
Max value after which outlier exist: 1329750.0

Outlier identification of AMT_GOODS_PRICE with On-Time Payments¶

Calculating IQR (Inter Quartile range)

In [240]:
Q1 = df0['AMT_GOODS_PRICE'].quantile(0.25)
Q3 = df0['AMT_GOODS_PRICE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
450000.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [241]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -436500.0
Max value after which outlier exist: 1363500.0

Removing outliers and plotting distplot

In [242]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_GOODS_PRICE'] <= Max_value1].AMT_GOODS_PRICE,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_GOODS_PRICE'] <= Max_value0].AMT_GOODS_PRICE,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • For AMT_GOODS_PRICE between ~250000 and ~550000, there are more clients with Payment difficulties
  • Otherwise there are spikes on and off but they don't show any conclusive observations

Analysis of DAYS_EMPLOYED¶

Outlier identification of DAYS_EMPLOYED with Payment difficulties¶

Calculating IQR (Inter Quartile range)

In [243]:
Q1 = df1['DAYS_EMPLOYED'].quantile(0.25)
Q3 = df1['DAYS_EMPLOYED'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
2603.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [244]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -3227.5
Max value after which outlier exist: 7184.5

Outlier identification of DAYS_EMPLOYED with On-Time Payments¶

Calculating IQR (Inter Quartile range)

In [245]:
Q1 = df0['DAYS_EMPLOYED'].quantile(0.25)
Q3 = df0['DAYS_EMPLOYED'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
5107.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [246]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -6693.5
Max value after which outlier exist: 13734.5

Removing outliers and plotting distplot

In [247]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['DAYS_EMPLOYED'] <= Max_value1].DAYS_EMPLOYED,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['DAYS_EMPLOYED'] <= Max_value0].DAYS_EMPLOYED,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • For DAYS_EMPLOYED less than 2000, there are more clients with Payment difficulties
  • Conversely, for DAYS_EMPLOYED > 2000 , there are more clients with On-Time Payments
  • This means that those who are employed longer have better chances of repaying the loan

Analysis of CNT_CHILDREN ¶

Outlier identification of CNT_CHILDREN with Payment difficulties¶

Calculating IQR (Inter Quartile range)

In [248]:
Q1 = df1['CNT_CHILDREN'].quantile(0.25)
Q3 = df1['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [249]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -1.5
Max value after which outlier exist: 2.5

Outlier identification of CNT_CHILDREN with On-Time Payments¶

Calculating IQR (Inter Quartile range)

In [250]:
Q1 = df0['CNT_CHILDREN'].quantile(0.25)
Q3 = df0['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [251]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -1.5
Max value after which outlier exist: 2.5

Removing outliers and plotting distplot

In [252]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['CNT_CHILDREN'] <= Max_value1].CNT_CHILDREN,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['CNT_CHILDREN'] <= Max_value0].CNT_CHILDREN,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • For CNT_CHILDREN 0 (those with no children), there are lots of clients with On-Time Payments
  • For CNT_CHILDREN with 1 OR 2 (those with 1 or 2 children), there are few more clients with On-Time Payments

Analysis of AMT_INCOME_TOTAL ¶

Outlier identification of AMT_INCOME_TOTAL with Payment difficulties¶

Calculating IQR (Inter Quartile range)

In [253]:
Q1 = df1['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = df1['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
90000.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [254]:
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -22500.0
Max value after which outlier exist: 337500.0

Outlier identification of AMT_INCOME_TOTAL with On-Time Payments¶

Calculating IQR (Inter Quartile range)

In [255]:
Q1 = df0['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = df0['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
90000.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [256]:
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -22500.0
Max value after which outlier exist: 337500.0

Removing outliers and plotting distplot

In [257]:
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_INCOME_TOTAL'] <= Max_value1].AMT_INCOME_TOTAL,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_INCOME_TOTAL'] <= Max_value0].AMT_INCOME_TOTAL,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • Based on AMT_INCOME_TOTAL, for clients with Payment difficulties, the distribution resembles a normal distribution approximately
  • But for clients with On-Time Payments, there are erratic spikes in the distribution which doesn't give any valid observations

Bivariate/Multivariate analysis¶

Continuous V/S Continuous variables¶

Analysis of AMT_GOODS_PRICE V/S AMT_CREDIT¶

Outlier identification of AMT_GOODS_PRICE with Payment difficulties

In [258]:
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE
Out[258]:
1329750.0

Outlier identification of AMT_CREDIT with Payment difficulties

In [259]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
Out[259]:
1406688.75

Outlier identification of AMT_GOODS_PRICE with On-Time Payments

In [260]:
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE
Out[260]:
1363500.0

Outlier identification of AMT_CREDIT with On-Time Payments

In [261]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
Out[261]:
1620000.0

Plotting scatterplot for comparison with outliers removed

In [262]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_GOODS_PRICE'] < max_value1_AMT_GOODS_PRICE].AMT_GOODS_PRICE, 
                y = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_GOODS_PRICE'] < max_value0_AMT_GOODS_PRICE].AMT_GOODS_PRICE, 
                y = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • AMT_GOODS_PRICE and AMT_CREDIT have strong positive correlation. This means that as Goods price increases, so does Credit Amount

Analysis of AMT_ANNUITY V/S AMT_CREDIT¶

Outlier identification of AMT_ANNUITY with Payment difficulties

In [263]:
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY
Out[263]:
56398.5

Outlier identification of AMT_CREDIT with Payment difficulties

In [264]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
Out[264]:
1406688.75

Outlier identification of AMT_ANNUITY with On-Time Payments

In [265]:
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY
Out[265]:
62187.75

Outlier identification of AMT_CREDIT with On-Time Payments

In [266]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
Out[266]:
1620000.0

Plotting scatterplot for comparison with outliers removed

In [267]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_ANNUITY'] < max_value1_AMT_ANNUITY].AMT_ANNUITY, 
                y = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_ANNUITY'] < max_value0_AMT_ANNUITY].AMT_ANNUITY, 
                y = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • AMT_ANNUITY and AMT_CREDIT have strong positive correlation. This means that as Annuity Amount increases, so does Credit Amount

Analysis of DAYS_EMPLOYED V/S AMT_INCOME_TOTAL¶

Outlier identification of DAYS_EMPLOYED with Payment difficulties

In [268]:
max_value1_DAYS_EMPLOYED = outlier_range(df1,'DAYS_EMPLOYED')
max_value1_DAYS_EMPLOYED
Out[268]:
7184.5

Outlier identification of AMT_INCOME_TOTAL with Payment difficulties

In [269]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
Out[269]:
337500.0

Outlier identification of DAYS_EMPLOYED with On-Time Payments

In [270]:
max_value0_DAYS_EMPLOYED = outlier_range(df0,'DAYS_EMPLOYED')
max_value0_DAYS_EMPLOYED
Out[270]:
13734.5

Outlier identification of AMT_INCOME_TOTAL with On-Time Payments

In [271]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
Out[271]:
337500.0

Plotting scatterplot for comparison with outliers removed

In [272]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['DAYS_EMPLOYED'] < max_value1_DAYS_EMPLOYED].DAYS_EMPLOYED, 
                y = df1[df1['AMT_INCOME_TOTAL'] < max_value1_AMT_INCOME_TOTAL].AMT_INCOME_TOTAL, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['DAYS_EMPLOYED'] < max_value0_DAYS_EMPLOYED].DAYS_EMPLOYED, 
                y = df0[df0['AMT_INCOME_TOTAL'] < max_value0_AMT_INCOME_TOTAL].AMT_INCOME_TOTAL, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • Clients who are employed for a long time (>7000) days are making their payments on-time but these category of clients do not exist in Payments difficulties group
  • Even looking at Payment difficulties group, clients with more than 4000 days of employment are sparse

Analysis of AMT_CREDIT V/S DAYS_BIRTH¶

Outlier identification of AMT_CREDIT with Payment difficulties

In [273]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
Out[273]:
1406688.75

Outlier identification of DAYS_BIRTH with Payment difficulties

In [274]:
max_value1_DAYS_BIRTH = outlier_range(df1,'DAYS_BIRTH')
max_value1_DAYS_BIRTH
Out[274]:
27998.5

Outlier identification of AMT_CREDIT with On-Time Payments

In [275]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
Out[275]:
1620000.0

Outlier identification of DAYS_BIRTH with On-Time Payments

In [276]:
max_value0_DAYS_BIRTH = outlier_range(df0,'DAYS_BIRTH')
max_value0_DAYS_BIRTH
Out[276]:
30680.375

Plotting scatterplot for comparison with outliers removed

In [277]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, 
                y = df1[df1['DAYS_BIRTH'] < max_value1_DAYS_BIRTH].DAYS_BIRTH, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, 
                y = df0[df0['DAYS_BIRTH'] < max_value0_DAYS_BIRTH].DAYS_BIRTH, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • There is no observable correlation between Days of Birth and Amount of Credit

Analysis of AMT_ANNUITY V/S AMT_GOODS_PRICE¶

Outlier identification of AMT_ANNUITY with Payment difficulties

In [278]:
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY
Out[278]:
56398.5

Outlier identification of AMT_GOODS_PRICE with Payment difficulties

In [279]:
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE
Out[279]:
1329750.0

Outlier identification of AMT_ANNUITY with On-Time Payments

In [280]:
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY
Out[280]:
62187.75

Outlier identification of AMT_GOODS_PRICE with On-Time Payments

In [281]:
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE
Out[281]:
1363500.0

Plotting scatterplot for comparison with outliers removed

In [282]:
plt.figure(figsize = [20,8])

plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_ANNUITY'] < max_value1_AMT_ANNUITY].AMT_ANNUITY, 
                y = df1[df1['AMT_GOODS_PRICE'] < max_value1_AMT_GOODS_PRICE].AMT_GOODS_PRICE, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_ANNUITY'] < max_value0_AMT_ANNUITY].AMT_ANNUITY, 
                y = df0[df0['AMT_GOODS_PRICE'] < max_value0_AMT_GOODS_PRICE].AMT_GOODS_PRICE, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • AMT_ANNUITY and AMT_GOODS_PRICE have strong positive correlation. This means that as Annuity increases, so does Goods Price

Continuous V/S Categorical variables¶

Analysis of NAME_EDUCATION_TYPE V/S AMT_CREDIT V/S CODE_GENDER¶

Outlier identification of AMT_CREDIT with Payment difficulties

In [283]:
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
Out[283]:
1406688.75

Outlier identification of AMT_CREDIT with On-Time Payments

In [284]:
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
Out[284]:
1620000.0

Client with Payment difficulties

In [285]:
df1.groupby(by = ['NAME_EDUCATION_TYPE','CODE_GENDER']).AMT_CREDIT.describe().head()
Out[285]:
count mean std min 25% 50% 75% max
NAME_EDUCATION_TYPE CODE_GENDER
Academic degree F 3.00 950245.50 504711.38 544491.00 667660.50 790830.00 1153122.75 1515415.50
Higher education F 2438.00 648500.58 410415.70 47970.00 323595.00 545040.00 877500.00 2687355.00
M 1571.00 642070.94 409695.93 45000.00 329202.00 545040.00 832797.00 2695500.00
Incomplete higher F 504.00 520150.35 352227.49 91692.00 269550.00 450000.00 675000.00 2695500.00
M 368.00 554723.82 347843.71 74628.00 284400.00 497520.00 699976.12 2013840.00

Client with On-Time Payments

In [286]:
df0.groupby(by = ['NAME_EDUCATION_TYPE','CODE_GENDER']).AMT_CREDIT.describe().head()
Out[286]:
count mean std min 25% 50% 75% max
NAME_EDUCATION_TYPE CODE_GENDER
Academic degree F 97.00 708770.74 463052.42 130824.00 354469.50 607500.00 1006920.00 2463840.00
M 64.00 735235.38 497739.97 74182.50 309519.00 610056.00 1036890.00 1984500.00
Higher education F 47852.00 682485.00 456642.78 45000.00 302206.50 568057.50 916470.00 4050000.00
M 23002.00 713144.54 486073.66 45000.00 315000.00 592560.00 973503.00 4050000.00
Incomplete higher F 5780.00 555340.53 385980.60 45000.00 253737.00 454500.00 760225.50 2517300.00
In [287]:
fn_bi_boxplot('NAME_EDUCATION_TYPE','AMT_CREDIT',max_value1_AMT_CREDIT,max_value0_AMT_CREDIT,'CODE_GENDER')
No description has been provided for this image

Observations

  • Clients with Academic Degree have a wide range of credits for On-Time Payments whereas the range is much lower for ones with Payment difficulties
  • Looking at summary statistics, Clients with Academic Degree and Payment difficulties take mean and median credit at a much higher range than On-Time Payment clients
  • Male clients with Academic Degree always pay the loan on-time

Analysis of NAME_FAMILY_STATUS V/S AMT_INCOME_TOTAL V/S CODE_GENDER¶

Outlier identification of AMT_INCOME_TOTAL with Payment difficulties

In [288]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
Out[288]:
337500.0

Outlier identification of AMT_INCOME_TOTAL with On-Time Payments

In [289]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
Out[289]:
337500.0

Client with Payment difficulties

In [290]:
df1.groupby(by = ['NAME_FAMILY_STATUS','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
Out[290]:
count mean std min 25% 50% 75% max
NAME_FAMILY_STATUS CODE_GENDER
Civil marriage F 1753.00 153620.72 101045.86 25650.00 108000.00 135000.00 180000.00 3150000.00
M 1208.00 173949.99 92449.17 36000.00 117000.00 157500.00 202500.00 1350000.00
Married F 8324.00 159907.85 1283283.40 27000.00 90000.00 135000.00 180000.00 117000000.00
M 6526.00 182233.06 94787.77 31500.00 126000.00 157500.00 225000.00 1890000.00
Separated F 1045.00 161159.55 82633.33 36000.00 112500.00 135000.00 202500.00 832500.00

Client with On-Time Payments

In [291]:
df0.groupby(by = ['NAME_FAMILY_STATUS','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
Out[291]:
count mean std min 25% 50% 75% max
NAME_FAMILY_STATUS CODE_GENDER
Civil marriage F 19017.00 162086.05 106811.49 27000.00 112500.00 135000.00 202500.00 9000000.00
M 7797.00 184126.29 106559.49 27000.00 121500.00 157500.00 225000.00 3825000.00
Married F 114125.00 154289.53 88593.43 26100.00 94500.00 135000.00 180000.00 3600000.00
M 67459.00 199592.83 147599.76 27000.00 135000.00 180000.00 225000.00 18000090.00
Separated F 14416.00 166374.21 94782.18 27000.00 112500.00 139500.00 202500.00 2250000.00
In [292]:
fn_bi_boxplot('NAME_FAMILY_STATUS','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')
No description has been provided for this image

Observations

  • Married clients have a slightly higher mean/median income with On-Time Payments than Payment difficulties category

Analysis of YEARS_BIRTH_CATEGORY V/S AMT_INCOME_TOTAL V/S NAME_HOUSING_TYPE¶

Outlier identification of AMT_INCOME_TOTAL with Payment difficulties

In [293]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
Out[293]:
337500.0

Outlier identification of AMT_INCOME_TOTAL with On-Time Payments

In [294]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
Out[294]:
337500.0

Client with Payment difficulties

In [295]:
df1.groupby(by = ['YEARS_BIRTH_CATEGORY','NAME_HOUSING_TYPE']).AMT_INCOME_TOTAL.describe().head()
Out[295]:
count mean std min 25% 50% 75% max
YEARS_BIRTH_CATEGORY NAME_HOUSING_TYPE
20-30 Co-op apartment 40.00 148628.70 59516.24 67500.00 110250.00 135000.00 185625.00 360000.00
House / apartment 4201.00 151897.22 70874.65 31500.00 112500.00 135000.00 180000.00 810000.00
Municipal apartment 153.00 153608.82 78282.79 45000.00 108000.00 135000.00 180000.00 675000.00
Office apartment 40.00 147375.00 50817.26 67500.00 112500.00 135000.00 180000.00 270000.00
Rented apartment 229.00 144210.10 59998.66 45000.00 99000.00 135000.00 180000.00 405000.00

Client with On-Time Payments

In [296]:
df0.groupby(by = ['YEARS_BIRTH_CATEGORY','NAME_HOUSING_TYPE']).AMT_INCOME_TOTAL.describe().head()
Out[296]:
count mean std min 25% 50% 75% max
YEARS_BIRTH_CATEGORY NAME_HOUSING_TYPE
20-30 Co-op apartment 298.00 161153.15 76171.26 45000.00 112500.00 139500.00 202500.00 720000.00
House / apartment 33165.00 163944.39 92313.18 27000.00 112500.00 141264.00 202500.00 4500000.00
Municipal apartment 1099.00 166235.08 88698.47 45000.00 112500.00 144000.00 202500.00 1417500.00
Office apartment 455.00 177960.18 115765.13 36000.00 112500.00 157500.00 225000.00 1800000.00
Rented apartment 1535.00 153028.06 70748.99 29250.00 109125.00 135000.00 180000.00 742500.00
In [297]:
fn_bi_boxplot('YEARS_BIRTH_CATEGORY','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'NAME_HOUSING_TYPE')
No description has been provided for this image

Observations

  • Clients with age 60-70 and living in Co-op apartment have very high income range in Payment difficulties category than On-Time Payments
  • Clients with age 20-30 and living in Office apartment have very higher income median in On-Time Payments compared to Payment difficulties category

Analysis of FLAG_OWN_CAR V/S AMT_ANNUITY V/S CODE_GENDER¶

Outlier identification of AMT_ANNUITY with Payment difficulties

In [298]:
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY
Out[298]:
56398.5

Outlier identification of AMT_ANNUITY with On-Time Payments

In [299]:
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY
Out[299]:
62187.75

Client with Payment difficulties

In [300]:
df1.groupby(by = ['FLAG_OWN_CAR','CODE_GENDER']).AMT_ANNUITY.describe().head()
Out[300]:
count mean std min 25% 50% 75% max
FLAG_OWN_CAR CODE_GENDER
N F 11567.00 24910.53 11855.56 2722.50 16350.75 23715.00 31261.50 149211.00
M 5682.00 26258.84 12075.78 3172.50 17517.38 25202.25 32352.75 127507.50
Y F 2603.00 28577.91 13256.17 3559.50 18969.75 26797.50 35961.75 128178.00
M 4973.00 29293.82 13130.94 4374.00 20007.00 27558.00 36459.00 115803.00

Client with On-Time Payments

In [301]:
df0.groupby(by = ['FLAG_OWN_CAR','CODE_GENDER']).AMT_ANNUITY.describe().head()
Out[301]:
count mean std min 25% 50% 75% max
FLAG_OWN_CAR CODE_GENDER
N F 145917.00 25351.26 13579.03 1615.50 15448.50 23283.00 32125.50 230161.50
M 39749.00 26799.80 14295.32 1980.00 16294.50 24750.00 34114.50 225000.00
Y F 42357.00 29830.22 15705.98 1993.50 18409.50 27085.50 38133.00 225000.00
M 54651.00 30200.48 15988.66 2317.50 19125.00 27630.00 38322.00 258025.50
In [302]:
fn_bi_boxplot('FLAG_OWN_CAR','AMT_ANNUITY',max_value1_AMT_ANNUITY,max_value0_AMT_ANNUITY,'CODE_GENDER')
No description has been provided for this image

Observations

  • We don't find any significant observations

Analysis of NAME_INCOME_TYPE V/S AMT_GOODS_PRICE V/S CODE_GENDER¶

Outlier identification of AMT_GOODS_PRICE with Payment difficulties

In [303]:
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE
Out[303]:
1329750.0

Outlier identification of AMT_GOODS_PRICE with On-Time Payments

In [304]:
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE
Out[304]:
1363500.0

Client with Payment difficulties

In [305]:
df1.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_GOODS_PRICE.describe().head()
Out[305]:
count mean std min 25% 50% 75% max
NAME_INCOME_TYPE CODE_GENDER
Commercial associate F 2966.00 536362.21 353710.65 45000.00 270000.00 450000.00 675000.00 3600000.00
M 2391.00 526387.31 335859.94 45000.00 270000.00 450000.00 675000.00 2961000.00
Maternity leave F 2.00 929250.00 715945.62 423000.00 676125.00 929250.00 1182375.00 1435500.00
Pensioner F 2241.00 495016.26 317042.45 45000.00 229500.00 450000.00 675000.00 2173500.00
M 739.00 486543.08 321104.43 45000.00 229500.00 450000.00 675000.00 1800000.00

Client with On-Time Payments

In [306]:
df0.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_GOODS_PRICE.describe().head()
Out[306]:
count mean std min 25% 50% 75% max
NAME_INCOME_TYPE CODE_GENDER
Businessman F 3.00 1125000.00 1031079.53 225000.00 562500.00 900000.00 1575000.00 2250000.00
M 7.00 1272857.14 755104.06 495000.00 607500.00 1350000.00 1800000.00 2250000.00
Commercial associate F 41522.00 602242.63 400824.95 45000.00 270000.00 472500.00 837000.00 4050000.00
M 24691.00 620220.21 422445.53 45000.00 270000.00 472500.00 900000.00 4050000.00
Maternity leave F 2.00 562500.00 350017.86 315000.00 438750.00 562500.00 686250.00 810000.00
In [307]:
fn_bi_boxplot('NAME_INCOME_TYPE','AMT_GOODS_PRICE',max_value1_AMT_GOODS_PRICE,max_value0_AMT_GOODS_PRICE,'CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who are Unemployed and Male have a very high price of goods in On-Time Payments than Payment difficulties
  • Clients who are Student and either Male OR Female do their payments On-Time. They are completely missing from Payment difficulties category. Student seems to be an attractive category to give loans to.
  • Clients who are Businessman and either Male OR Female do their payments On-Time. They are completely missing from Payment difficulties category. Businessman seems to be an attractive category to give loans to.

Analysis of NAME_INCOME_TYPE V/S AMT_INCOME_TOTAL V/S CODE_GENDER¶

Outlier identification of AMT_INCOME_TOTAL with Payment difficulties

In [308]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
Out[308]:
337500.0

Outlier identification of AMT_INCOME_TOTAL with On-Time Payments

In [309]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
Out[309]:
337500.0

Client with Payment difficulties

In [310]:
df1.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
Out[310]:
count mean std min 25% 50% 75% max
NAME_INCOME_TYPE CODE_GENDER
Commercial associate F 2968.00 172962.54 94922.97 36000.00 112500.00 157500.00 202500.00 1575000.00
M 2392.00 207145.51 120281.01 45000.00 135000.00 180000.00 225000.00 1890000.00
Maternity leave F 2.00 58500.00 12727.92 49500.00 54000.00 58500.00 63000.00 67500.00
Pensioner F 2243.00 133275.59 71580.63 25650.00 90000.00 117000.00 157500.00 1260000.00
M 739.00 142481.26 76941.74 31500.00 90000.00 135000.00 180000.00 585000.00

Client with On-Time Payments

In [311]:
df0.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
Out[311]:
count mean std min 25% 50% 75% max
NAME_INCOME_TYPE CODE_GENDER
Businessman F 3.00 555000.00 337749.91 225000.00 382500.00 540000.00 720000.00 900000.00
M 7.00 694285.71 738446.73 180000.00 202500.00 450000.00 787500.00 2250000.00
Commercial associate F 41552.00 187223.56 108354.86 26550.00 117000.00 157500.00 225000.00 3600000.00
M 24705.00 232612.57 199602.65 36000.00 157500.00 202500.00 270000.00 18000090.00
Maternity leave F 2.00 112500.00 31819.81 90000.00 101250.00 112500.00 123750.00 135000.00
In [312]:
fn_bi_boxplot('NAME_INCOME_TYPE','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who are Unemployed and Male have a very high income in On-Time Payments than Payment difficulties
  • Clients who are Student and either Male OR Female do their payments On-Time. They are completely missing from Payment difficulties category. Student seems to be an attractive category to give loans to.
  • Clients who are Businessman and either Male OR Female do their payments On-Time. They are completely missing from Payment difficulties category. Businessman seems to be an attractive category to give loans to.
  • Clients who are in Maternity Leave and Female have a very high income in On-Time Payments than Payment difficulties

Analysis of OCCUPATION_TYPE V/S AMT_INCOME_TOTAL V/S CODE_GENDER¶

Outlier identification of AMT_INCOME_TOTAL with Payment difficulties

In [313]:
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
Out[313]:
337500.0

Outlier identification of AMT_INCOME_TOTAL with On-Time Payments

In [314]:
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
Out[314]:
337500.0

Client with Payment difficulties

In [315]:
df1.groupby(by = ['OCCUPATION_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
Out[315]:
count mean std min 25% 50% 75% max
OCCUPATION_TYPE CODE_GENDER
Accountants F 457.00 174840.64 79021.04 54000.00 112500.00 157500.00 202500.00 540000.00
M 17.00 202367.65 98623.66 81000.00 121500.00 180000.00 252000.00 438750.00
Cleaning staff F 410.00 128907.99 70328.23 31500.00 90000.00 112500.00 157500.00 720000.00
M 37.00 119493.24 45786.21 45000.00 90000.00 112500.00 135000.00 247500.00
Cooking staff F 555.00 129298.38 55531.88 32400.00 90000.00 112500.00 157500.00 405000.00

Client with On-Time Payments

In [316]:
df0.groupby(by = ['OCCUPATION_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
Out[316]:
count mean std min 25% 50% 75% max
OCCUPATION_TYPE CODE_GENDER
Accountants F 9072.00 194324.51 109135.61 27000.00 130500.00 175500.00 225000.00 2214117.00
M 267.00 236490.89 131795.33 54000.00 141750.00 216000.00 270000.00 1035000.00
Cleaning staff F 3904.00 130602.05 63495.05 25650.00 90000.00 112500.00 157500.00 1080000.00
M 302.00 137172.52 64373.09 31500.00 90000.00 132750.00 169875.00 450000.00
Cooking staff F 4830.00 134460.40 60213.56 27000.00 90000.00 121500.00 157500.00 540000.00
In [317]:
fn_bi_boxplot('OCCUPATION_TYPE','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who are Waiters/barment staff and female have less median income in On-Time Payments than Payment difficulties
  • Clients who are Cleaning staff and female have more median income in On-Time Payments than Payment difficulties
  • Clients who are HR Staff and Male have more median income in Payment difficulties than On-Time Payments
  • Clients who are Managers and Male have more median income in On-Time Payments than Payment difficulties

Categorical V/S Categorical variables¶

Analysis of NAME_INCOME_TYPE V/S CODE_GENDER¶

In [318]:
fn_bi_countplot('NAME_INCOME_TYPE','CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who are Working and Male have more Payment difficulties compared to On-Time Payments
  • Clients who are Pensioner and Female have more Payment difficulties compared to On-Time Payments
  • Clients who are Businessman and Students do their payments On-Time though their record count is low

Analysis of NAME_EDUCATION_TYPE V/S CODE_GENDER¶

In [319]:
fn_bi_countplot('NAME_EDUCATION_TYPE','CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who have Secondary/Secondary special education and Male have more Payment difficulties compared to On-Time Payments
  • Clients who have Higher education and Female have more On-Time Payments compared to Payment difficulties

Analysis of NAME_FAMILY_STATUS V/S OCCUPATION_TYPE¶

In [320]:
fn_bi_countplot('OCCUPATION_TYPE','NAME_FAMILY_STATUS')
No description has been provided for this image

Observations

  • Clients who are Single/not married, Married & Civil marriage and are Waiters/barmen staff have more Payment difficulties compared to On-Time Payments
  • Clients who are Single/not married & Married and are Laborers have more Payment difficulties compared to On-Time Payments
  • Clients who are Married and are Drivers have more Payment difficulties compared to On-Time Payments
  • Married and Accountants have better On-Time Payments

Analysis of ORGANIZATION_TYPE V/S FLAG_OWN_CAR¶

In [321]:
fn_bi_countplot('ORGANIZATION_TYPE','FLAG_OWN_CAR')
No description has been provided for this image

Observations

  • Clients who are Self-employedand don't own Car have more Payment difficulties compared to On-Time Payments

Analysis of OCCUPATION_TYPE V/S NAME_CONTRACT_TYPE¶

In [322]:
fn_bi_countplot('OCCUPATION_TYPE','NAME_CONTRACT_TYPE')
No description has been provided for this image

Observations

  • Clients who are Sales staff,Laborers,Drivers and have Cash loans have more Payment difficulties compared to On-Time Payments

Previous Application Data¶


Importing the dataset¶

In [6]:
df_prev = pd.read_csv("Previous_application.csv")
In [324]:
# Checking few records from the dataframe
df_prev.head()
Out[324]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.43 17145.00 17145.00 0.00 17145.00 SATURDAY 15 Y 1 0.00 0.18 0.87 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.00 middle POS mobile with interest 365243.00 -42.00 300.00 -42.00 -37.00 0.00
1 2802425 108129 Cash loans 25188.62 607500.00 679671.00 NaN 607500.00 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.00 low_action Cash X-Sell: low 365243.00 -134.00 916.00 365243.00 365243.00 1.00
2 2523466 122040 Cash loans 15060.74 112500.00 136444.50 NaN 112500.00 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 high Cash X-Sell: high 365243.00 -271.00 59.00 365243.00 365243.00 1.00
3 2819243 176158 Cash loans 47041.33 450000.00 470790.00 NaN 450000.00 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 middle Cash X-Sell: middle 365243.00 -482.00 -152.00 -182.00 -177.00 1.00
4 1784265 202054 Cash loans 31924.40 337500.00 404055.00 NaN 337500.00 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.00 high Cash Street: high NaN NaN NaN NaN NaN NaN

Check structure of the data¶

In [325]:
df_prev.info(verbose = True,null_counts = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB

We do not see any columns with Nullable values

In [326]:
df_prev.shape
Out[326]:
(1670214, 37)

There are ~1.67 million rows and 37 columns

Get statistical summary for numerical variables¶

In [327]:
df_prev.describe()
Out[327]:
SK_ID_PREV SK_ID_CURR AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
count 1670214.00 1670214.00 1297979.00 1670214.00 1670213.00 774370.00 1284699.00 1670214.00 1670214.00 774370.00 5951.00 5951.00 1670214.00 1670214.00 1297984.00 997149.00 997149.00 997149.00 997149.00 997149.00 997149.00
mean 1923089.14 278357.17 15955.12 175233.86 196114.02 6697.40 227847.28 12.48 1.00 0.08 0.19 0.77 -880.68 313.95 16.05 342209.86 13826.27 33767.77 76582.40 81992.34 0.33
std 532597.96 102814.82 14782.14 292779.76 318574.62 20921.50 315396.56 3.33 0.06 0.11 0.09 0.10 779.10 7127.44 14.57 88916.12 72444.87 106857.03 149647.42 153303.52 0.47
min 1000001.00 100001.00 0.00 0.00 0.00 -0.90 0.00 0.00 0.00 -0.00 0.03 0.37 -2922.00 -1.00 0.00 -2922.00 -2892.00 -2801.00 -2889.00 -2874.00 0.00
25% 1461857.25 189329.00 6321.78 18720.00 24160.50 0.00 50841.00 10.00 1.00 0.00 0.16 0.72 -1300.00 -1.00 6.00 365243.00 -1628.00 -1242.00 -1314.00 -1270.00 0.00
50% 1923110.50 278714.50 11250.00 71046.00 80541.00 1638.00 112320.00 12.00 1.00 0.05 0.19 0.84 -581.00 3.00 12.00 365243.00 -831.00 -361.00 -537.00 -499.00 0.00
75% 2384279.75 367514.00 20658.42 180360.00 216418.50 7740.00 234000.00 15.00 1.00 0.11 0.19 0.85 -280.00 82.00 24.00 365243.00 -411.00 129.00 -74.00 -44.00 1.00
max 2845382.00 456255.00 418058.15 6905160.00 6905160.00 3060045.00 6905160.00 23.00 1.00 1.00 1.00 1.00 -1.00 4000000.00 84.00 365243.00 365243.00 365243.00 365243.00 365243.00 1.00

Analyzing categorical variables¶

In [328]:
df_prev.select_dtypes(include = "object").columns
Out[328]:
Index(['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
       'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY',
       'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE',
       'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
      dtype='object')
In [329]:
# Checking number of categorical variables
len(df_prev.select_dtypes(include = "object").columns)
Out[329]:
16

There are 16 categorical variables

Analyzing numerical variables¶

In [330]:
df_prev.select_dtypes(include=["int64","float64"]).columns
Out[330]:
Index(['SK_ID_PREV', 'SK_ID_CURR', 'AMT_ANNUITY', 'AMT_APPLICATION',
       'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'HOUR_APPR_PROCESS_START', 'NFLAG_LAST_APPL_IN_DAY',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
       'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
       'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
       'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')
In [331]:
# Checking number of categorical variables
len(df_prev.select_dtypes(include=["int64","float64"]).columns)
Out[331]:
21

There are 21 numerical variables

In [332]:
df_prev.select_dtypes(include=["int64","float64"]).head()
Out[332]:
SK_ID_PREV SK_ID_CURR AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 1730.43 17145.00 17145.00 0.00 17145.00 15 1 0.00 0.18 0.87 -73 35 12.00 365243.00 -42.00 300.00 -42.00 -37.00 0.00
1 2802425 108129 25188.62 607500.00 679671.00 NaN 607500.00 11 1 NaN NaN NaN -164 -1 36.00 365243.00 -134.00 916.00 365243.00 365243.00 1.00
2 2523466 122040 15060.74 112500.00 136444.50 NaN 112500.00 11 1 NaN NaN NaN -301 -1 12.00 365243.00 -271.00 59.00 365243.00 365243.00 1.00
3 2819243 176158 47041.33 450000.00 470790.00 NaN 450000.00 7 1 NaN NaN NaN -512 -1 12.00 365243.00 -482.00 -152.00 -182.00 -177.00 1.00
4 1784265 202054 31924.40 337500.00 404055.00 NaN 337500.00 9 1 NaN NaN NaN -781 -1 24.00 NaN NaN NaN NaN NaN NaN

Dealing with incorrect data types¶

Check if we have any column with incorrect data type

In [333]:
df_prev.dtypes
Out[333]:
SK_ID_PREV                       int64
SK_ID_CURR                       int64
NAME_CONTRACT_TYPE              object
AMT_ANNUITY                    float64
AMT_APPLICATION                float64
AMT_CREDIT                     float64
AMT_DOWN_PAYMENT               float64
AMT_GOODS_PRICE                float64
WEEKDAY_APPR_PROCESS_START      object
HOUR_APPR_PROCESS_START          int64
FLAG_LAST_APPL_PER_CONTRACT     object
NFLAG_LAST_APPL_IN_DAY           int64
RATE_DOWN_PAYMENT              float64
RATE_INTEREST_PRIMARY          float64
RATE_INTEREST_PRIVILEGED       float64
NAME_CASH_LOAN_PURPOSE          object
NAME_CONTRACT_STATUS            object
DAYS_DECISION                    int64
NAME_PAYMENT_TYPE               object
CODE_REJECT_REASON              object
NAME_TYPE_SUITE                 object
NAME_CLIENT_TYPE                object
NAME_GOODS_CATEGORY             object
NAME_PORTFOLIO                  object
NAME_PRODUCT_TYPE               object
CHANNEL_TYPE                    object
SELLERPLACE_AREA                 int64
NAME_SELLER_INDUSTRY            object
CNT_PAYMENT                    float64
NAME_YIELD_GROUP                object
PRODUCT_COMBINATION             object
DAYS_FIRST_DRAWING             float64
DAYS_FIRST_DUE                 float64
DAYS_LAST_DUE_1ST_VERSION      float64
DAYS_LAST_DUE                  float64
DAYS_TERMINATION               float64
NFLAG_INSURED_ON_APPROVAL      float64
dtype: object
In [334]:
df_prev.head()
Out[334]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.43 17145.00 17145.00 0.00 17145.00 SATURDAY 15 Y 1 0.00 0.18 0.87 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.00 middle POS mobile with interest 365243.00 -42.00 300.00 -42.00 -37.00 0.00
1 2802425 108129 Cash loans 25188.62 607500.00 679671.00 NaN 607500.00 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.00 low_action Cash X-Sell: low 365243.00 -134.00 916.00 365243.00 365243.00 1.00
2 2523466 122040 Cash loans 15060.74 112500.00 136444.50 NaN 112500.00 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 high Cash X-Sell: high 365243.00 -271.00 59.00 365243.00 365243.00 1.00
3 2819243 176158 Cash loans 47041.33 450000.00 470790.00 NaN 450000.00 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.00 middle Cash X-Sell: middle 365243.00 -482.00 -152.00 -182.00 -177.00 1.00
4 1784265 202054 Cash loans 31924.40 337500.00 404055.00 NaN 337500.00 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.00 high Cash Street: high NaN NaN NaN NaN NaN NaN

Looking at the data and their corresponding data types, we can conclude that No Data Type changes are required

Dealing with missing values¶

Check if we have any null values in the dataset

In [335]:
df_prev.isnull().values.any()
Out[335]:
True

Get Total number of null values in the dataset

In [336]:
df_prev.isnull().values.sum()
Out[336]:
11109336

Getting the list of column(s) which have null values

In [337]:
df_prev.columns[df_prev.isnull().any()]
Out[337]:
Index(['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'NAME_TYPE_SUITE', 'CNT_PAYMENT',
       'PRODUCT_COMBINATION', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
       'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
       'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')
In [338]:
len(df_prev.columns[df_prev.isnull().any()])
Out[338]:
16

There are totally 16 columns having one or more NULL values in the data

Computing count and percentage of missing values¶

In [339]:
null_count = df_prev.isnull().sum()
null_percentage = round((df_prev.isnull().sum()/df_prev.shape[0])*100, 2)
In [340]:
null_df = pd.DataFrame({'column_name' : df_prev.columns,'null_count' : null_count,'null_percentage': null_percentage})
null_df.reset_index(drop = True, inplace = True)
In [341]:
null_df.sort_values(by = 'null_percentage', ascending = False)
Out[341]:
column_name null_count null_percentage
14 RATE_INTEREST_PRIVILEGED 1664263 99.64
13 RATE_INTEREST_PRIMARY 1664263 99.64
12 RATE_DOWN_PAYMENT 895844 53.64
6 AMT_DOWN_PAYMENT 895844 53.64
20 NAME_TYPE_SUITE 820405 49.12
36 NFLAG_INSURED_ON_APPROVAL 673065 40.30
31 DAYS_FIRST_DRAWING 673065 40.30
32 DAYS_FIRST_DUE 673065 40.30
33 DAYS_LAST_DUE_1ST_VERSION 673065 40.30
34 DAYS_LAST_DUE 673065 40.30
35 DAYS_TERMINATION 673065 40.30
7 AMT_GOODS_PRICE 385515 23.08
3 AMT_ANNUITY 372235 22.29
28 CNT_PAYMENT 372230 22.29
30 PRODUCT_COMBINATION 346 0.02
25 CHANNEL_TYPE 0 0.00
24 NAME_PRODUCT_TYPE 0 0.00
29 NAME_YIELD_GROUP 0 0.00
26 SELLERPLACE_AREA 0 0.00
27 NAME_SELLER_INDUSTRY 0 0.00
22 NAME_GOODS_CATEGORY 0 0.00
23 NAME_PORTFOLIO 0 0.00
0 SK_ID_PREV 0 0.00
21 NAME_CLIENT_TYPE 0 0.00
19 CODE_REJECT_REASON 0 0.00
1 SK_ID_CURR 0 0.00
17 DAYS_DECISION 0 0.00
16 NAME_CONTRACT_STATUS 0 0.00
15 NAME_CASH_LOAN_PURPOSE 0 0.00
11 NFLAG_LAST_APPL_IN_DAY 0 0.00
10 FLAG_LAST_APPL_PER_CONTRACT 0 0.00
9 HOUR_APPR_PROCESS_START 0 0.00
8 WEEKDAY_APPR_PROCESS_START 0 0.00
5 AMT_CREDIT 1 0.00
4 AMT_APPLICATION 0 0.00
2 NAME_CONTRACT_TYPE 0 0.00
18 NAME_PAYMENT_TYPE 0 0.00

Removing columns with NULL values > 40%¶

Getting list of columns with NULL values > 40% into a list. We will be removing these columns from the dataframe as there are too many missing values.

In [342]:
columns_to_be_deleted = null_df[null_df['null_percentage'] > 40].column_name.to_list()
In [343]:
columns_to_be_deleted
Out[343]:
['AMT_DOWN_PAYMENT',
 'RATE_DOWN_PAYMENT',
 'RATE_INTEREST_PRIMARY',
 'RATE_INTEREST_PRIVILEGED',
 'NAME_TYPE_SUITE',
 'DAYS_FIRST_DRAWING',
 'DAYS_FIRST_DUE',
 'DAYS_LAST_DUE_1ST_VERSION',
 'DAYS_LAST_DUE',
 'DAYS_TERMINATION',
 'NFLAG_INSURED_ON_APPROVAL']
In [344]:
len(columns_to_be_deleted)
Out[344]:
11

There are totally 11 columns to be removed. Deleting them from previous application dataframe df_prev

In [345]:
df_prev.drop(columns = columns_to_be_deleted, inplace = True)

Checking column count post removal. Only 26 columns should be left

In [346]:
df_prev.shape
Out[346]:
(1670214, 26)

Checking columns with NULL values < 40%¶

Creating dataframe null_df_under40 with missing column percentages under 40%

In [347]:
null_df_under40 = null_df[null_df['null_percentage'] < 40]
In [348]:
null_df_under40.sort_values(by = 'null_percentage', ascending = False)
Out[348]:
column_name null_count null_percentage
7 AMT_GOODS_PRICE 385515 23.08
3 AMT_ANNUITY 372235 22.29
28 CNT_PAYMENT 372230 22.29
30 PRODUCT_COMBINATION 346 0.02
18 NAME_PAYMENT_TYPE 0 0.00
29 NAME_YIELD_GROUP 0 0.00
27 NAME_SELLER_INDUSTRY 0 0.00
26 SELLERPLACE_AREA 0 0.00
25 CHANNEL_TYPE 0 0.00
24 NAME_PRODUCT_TYPE 0 0.00
23 NAME_PORTFOLIO 0 0.00
22 NAME_GOODS_CATEGORY 0 0.00
21 NAME_CLIENT_TYPE 0 0.00
19 CODE_REJECT_REASON 0 0.00
0 SK_ID_PREV 0 0.00
1 SK_ID_CURR 0 0.00
16 NAME_CONTRACT_STATUS 0 0.00
15 NAME_CASH_LOAN_PURPOSE 0 0.00
11 NFLAG_LAST_APPL_IN_DAY 0 0.00
10 FLAG_LAST_APPL_PER_CONTRACT 0 0.00
9 HOUR_APPR_PROCESS_START 0 0.00
8 WEEKDAY_APPR_PROCESS_START 0 0.00
5 AMT_CREDIT 1 0.00
4 AMT_APPLICATION 0 0.00
2 NAME_CONTRACT_TYPE 0 0.00
17 DAYS_DECISION 0 0.00

Analysis of AMT_GOODS_PRICE column¶

  • nullable values = 23.08%
In [349]:
df_prev.AMT_GOODS_PRICE.value_counts().head()
Out[349]:
45000.00     47831
225000.00    43549
135000.00    40666
450000.00    38926
90000.00     29367
Name: AMT_GOODS_PRICE, dtype: int64
In [350]:
sns.boxplot(df_prev.AMT_GOODS_PRICE)
plt.show()
No description has been provided for this image

Getting percentile values for AMT_GOODS_PRICE

In [351]:
df_prev.AMT_GOODS_PRICE.quantile(q = [0.25,0.5,0.75,1])
Out[351]:
0.25     50841.00
0.50    112320.00
0.75    234000.00
1.00   6905160.00
Name: AMT_GOODS_PRICE, dtype: float64

Most recurring value in AMT_GOODS_PRICE

In [352]:
df_prev.AMT_GOODS_PRICE.mode()[0]
Out[352]:
45000.0

Checking the average value of AMT_GOODS_PRICE

In [353]:
df_prev.AMT_GOODS_PRICE.mean()
Out[353]:
227847.27928334344

Observations

  • Looking at the boxplot, median is 112320.00
  • Most recurring value is 45000.0
  • Mean value is 227847.27928334344
  • Since missing percentage value is higher (23.08%), it would be better to leave the data as it is and not perform imputations

Analysis of AMT_ANNUITY column¶

  • nullable values = 22.29%
In [354]:
df_prev.AMT_ANNUITY.value_counts().head()
Out[354]:
2250.00     31865
11250.00    13974
6750.00     13442
9000.00     12496
22500.00    11903
Name: AMT_ANNUITY, dtype: int64
In [355]:
sns.boxplot(df_prev.AMT_ANNUITY)
plt.show()
No description has been provided for this image

Getting percentile values for AMT_ANNUITY

In [356]:
df_prev.AMT_ANNUITY.quantile(q = [0.25,0.5,0.75,1])
Out[356]:
0.25     6321.78
0.50    11250.00
0.75    20658.42
1.00   418058.15
Name: AMT_ANNUITY, dtype: float64

Most recurring value in AMT_ANNUITY

In [357]:
df_prev.AMT_ANNUITY.mode()[0]
Out[357]:
2250.0

Checking the average value of AMT_ANNUITY

In [358]:
df_prev.AMT_ANNUITY.mean()
Out[358]:
15955.120659450406

Observations

  • Looking at the boxplot, median is 11250.00
  • Most recurring value is 2250.0
  • Mean value is 15955.120659450406
  • Since missing percentage value is higher (22.29%), it would be better to leave the data as it is and not perform imputations

Analysis of CNT_PAYMENT column¶

  • nullable values = 22.29%
In [359]:
df_prev.CNT_PAYMENT.value_counts().head()
Out[359]:
12.00    323049
6.00     190461
0.00     144985
10.00    141851
24.00    137764
Name: CNT_PAYMENT, dtype: int64
In [360]:
sns.boxplot(df_prev.CNT_PAYMENT)
plt.show()
No description has been provided for this image

Getting percentile values for CNT_PAYMENT

In [361]:
df_prev.CNT_PAYMENT.quantile(q = [0.25,0.5,0.75,1])
Out[361]:
0.25    6.00
0.50   12.00
0.75   24.00
1.00   84.00
Name: CNT_PAYMENT, dtype: float64

Most recurring value in CNT_PAYMENT

In [362]:
df_prev.CNT_PAYMENT.mode()[0]
Out[362]:
12.0

Checking the average value of CNT_PAYMENT

In [363]:
df_prev.CNT_PAYMENT.mean()
Out[363]:
16.0540815603274

Observations

  • Looking at the boxplot, median is 12.00
  • Most recurring value is 12.0
  • Mean value is 16.0540815603274
  • Though median & mode are same, since missing percentage value is higher (22.29%), it would be better to leave the data as it is and not perform imputations

Checking columns with NULL values > 0% and < 1%¶

Creating dataframe null_df_under1 with missing column percentages values > 0% and < 1%

In [364]:
null_df_under1 = null_df[(null_df['null_percentage'] > 0) & (null_df['null_percentage'] < 1)]
In [365]:
null_df_under1.sort_values(by = 'null_percentage', ascending = False)
Out[365]:
column_name null_count null_percentage
30 PRODUCT_COMBINATION 346 0.02

Analysis of PRODUCT_COMBINATION column¶

  • nullable values = 0.02%
In [366]:
df_prev['PRODUCT_COMBINATION'].value_counts().head()
Out[366]:
Cash                           285990
POS household with interest    263622
POS mobile with interest       220670
Cash X-Sell: middle            143883
Cash X-Sell: low               130248
Name: PRODUCT_COMBINATION, dtype: int64
In [367]:
plt.figure(figsize = (10,5))
sns.countplot(data = df_prev, x = "PRODUCT_COMBINATION")
plt.xticks(rotation = 90)
plt.show()
No description has been provided for this image

Observations

  • Looking at the plot, Cash Product_Combination has the highest number of loan applicants
  • We can go ahead and impute Cash in the dataframe

Dealing with incorrect/unknown data values¶

Analysis of NAME_CASH_LOAN_PURPOSE column¶

Checking range of values

In [368]:
df_prev['NAME_CASH_LOAN_PURPOSE'].value_counts(normalize = True).head()
Out[368]:
XAP            0.55
XNA            0.41
Repairs        0.01
Other          0.01
Urgent needs   0.01
Name: NAME_CASH_LOAN_PURPOSE, dtype: float64

Observations

  • Though XAP and XNA don't provide any understanding of Loan purpose, they indicate Not Available and Not Applicable
  • Since this makes up 99% of the data, we will leave it as is

Analysis of DAYS_DECISION column¶

In [369]:
df_prev['DAYS_DECISION'].value_counts()
Out[369]:
-245     2444
-238     2390
-210     2375
-273     2350
-196     2315
         ... 
-2882     163
-2922     162
-2921     158
-2902     148
-1          2
Name: DAYS_DECISION, Length: 2922, dtype: int64

There are 2922 unique records all of which seem to be having negative values

In [370]:
df_prev['DAYS_DECISION'].unique()
Out[370]:
array([  -73,  -164,  -301, ..., -1967, -2389,    -1])
In [371]:
df_prev['DAYS_DECISION'].nunique()
Out[371]:
2922

Converting DAYS_DECISION to positive days

In [372]:
df_prev['DAYS_DECISION'] = df_prev['DAYS_DECISION'].apply(lambda x: -x if x < 0 else x)
In [373]:
df_prev['DAYS_DECISION'].value_counts()
Out[373]:
245     2444
238     2390
210     2375
273     2350
196     2315
        ... 
2882     163
2922     162
2921     158
2902     148
1          2
Name: DAYS_DECISION, Length: 2922, dtype: int64

All Days in DAYS_DECISION have positive values

Analysis of NAME_PAYMENT_TYPE column¶

Checking range of values

In [374]:
df_prev['NAME_PAYMENT_TYPE'].value_counts(normalize = True)
Out[374]:
Cash through the bank                       0.62
XNA                                         0.38
Non-cash from your account                  0.00
Cashless from the account of the employer   0.00
Name: NAME_PAYMENT_TYPE, dtype: float64

Observations

  • Though XNA doesn't provide any understanding of Payment TYpe, it indicates Not Applicable
  • Since this makes up 38% of the data, we will leave it as is

Analysis of CODE_REJECT_REASON column¶

Checking range of values

In [375]:
df_prev['CODE_REJECT_REASON'].value_counts(normalize = True)
Out[375]:
XAP      0.81
HC       0.10
LIMIT    0.03
SCO      0.02
CLIENT   0.02
SCOFR    0.01
XNA      0.00
VERIF    0.00
SYSTEM   0.00
Name: CODE_REJECT_REASON, dtype: float64

Observations

  • Though XAP doesn't provide any understanding of Payment Type, it indicates Not Available
  • Since this makes up 81% of the data, we will leave it as is

Analysis of NAME_CLIENT_TYPE column¶

Checking range of values

In [376]:
df_prev['NAME_CLIENT_TYPE'].value_counts(normalize=True)
Out[376]:
Repeater    0.74
New         0.18
Refreshed   0.08
XNA         0.00
Name: NAME_CLIENT_TYPE, dtype: float64

XNA value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application

In [377]:
df_prev[df_prev['NAME_CLIENT_TYPE'] == 'XNA'].head()
Out[377]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION
694 1044207 241230 Cash loans NaN 0.00 0.00 NaN SUNDAY 12 Y 1 XNA Canceled 53 XNA XAP XNA XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash
3419 1167657 366138 Cash loans 55400.89 1354500.00 1451047.50 1354500.00 MONDAY 17 Y 1 XNA Refused 231 Cash through the bank HC XNA XNA Cash x-sell Credit and cash offices -1 XNA 36.00 low_normal Cash X-Sell: low
3594 2609511 334664 Consumer loans 2223.99 20659.50 23112.00 20659.50 MONDAY 11 Y 1 XAP Approved 245 Cash through the bank XAP XNA Computers POS XNA Country-wide 50 Consumer electronics 12.00 low_action POS household without interest
5819 2092311 281867 Cash loans NaN 0.00 0.00 NaN MONDAY 13 Y 1 XNA Canceled 242 XNA XAP XNA XNA XNA XNA Credit and cash offices -1 XNA NaN XNA Cash
5990 2680830 335491 Cash loans 32048.46 877500.00 1016496.00 877500.00 MONDAY 9 Y 1 Repairs Refused 195 Cash through the bank HC XNA XNA Cash walk-in Credit and cash offices -1 XNA 48.00 low_normal Cash Street: low

As data looks valid, we will go ahead and check for an imputation method.

  • Repeater applicants make up 74% of applicants
  • And so, we will go ahead and impute NAME_CLIENT_TYPE with 'Repeater'
In [378]:
df_prev['NAME_CLIENT_TYPE'] = df_prev['NAME_CLIENT_TYPE'].apply(lambda x: 'Repeater' if x == 'XNA' else x)

Checking if XNA is removed

In [379]:
df_prev['NAME_CLIENT_TYPE'].value_counts(normalize=True)
Out[379]:
Repeater    0.74
New         0.18
Refreshed   0.08
Name: NAME_CLIENT_TYPE, dtype: float64

There are other columns with XNA but we will leave them as is

Dealing with outliers for numerical columns¶

Analysis of AMT_ANNUITY column¶

In [380]:
df_prev['AMT_ANNUITY'].value_counts().sort_values(ascending = False).head()
Out[380]:
2250.00     31865
11250.00    13974
6750.00     13442
9000.00     12496
22500.00    11903
Name: AMT_ANNUITY, dtype: int64
In [381]:
(df_prev['AMT_ANNUITY'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[381]:
2250.00    2.45
11250.00   1.08
6750.00    1.04
9000.00    0.96
22500.00   0.92
Name: AMT_ANNUITY, dtype: float64
In [382]:
df_prev.AMT_ANNUITY.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[382]:
0.25     6321.78
0.50    11250.00
0.75    20658.42
0.99    69685.79
1.00   418058.15
Name: AMT_ANNUITY, dtype: float64
In [383]:
fn_dist_box(df_prev,'AMT_ANNUITY')
No description has been provided for this image

Calculating IQR (Inter Quartile range)

In [384]:
Q1 = df_prev['AMT_ANNUITY'].quantile(0.25)
Q3 = df_prev['AMT_ANNUITY'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
14336.64

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [385]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -15183.18
Max value after which outlier exist: 42163.38

Observations

  • AMT_ANNUITY values above 42163.38 are outliers

Analysis of AMT_APPLICATION column¶

In [386]:
df_prev['AMT_APPLICATION'].value_counts().sort_values(ascending = False).head()
Out[386]:
0.00         392402
45000.00      47831
225000.00     43543
135000.00     40678
450000.00     38905
Name: AMT_APPLICATION, dtype: int64
In [387]:
(df_prev['AMT_APPLICATION'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[387]:
0.00        23.49
45000.00     2.86
225000.00    2.61
135000.00    2.44
450000.00    2.33
Name: AMT_APPLICATION, dtype: float64
In [388]:
df_prev.AMT_APPLICATION.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[388]:
0.25     18720.00
0.50     71046.00
0.75    180360.00
0.99   1350000.00
1.00   6905160.00
Name: AMT_APPLICATION, dtype: float64
In [389]:
fn_dist_box(df_prev,'AMT_APPLICATION')
No description has been provided for this image

Calculating IQR (Inter Quartile range)

In [390]:
Q1 = df_prev['AMT_APPLICATION'].quantile(0.25)
Q3 = df_prev['AMT_APPLICATION'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
161640.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [391]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -223740.0
Max value after which outlier exist: 422820.0

Observations

  • AMT_APPLICATION values above 422820.0 are outliers

Analysis of AMT_CREDIT column¶

In [392]:
df_prev['AMT_CREDIT'].value_counts().sort_values(ascending = False).head()
Out[392]:
0.00         336768
45000.00      35051
225000.00     21094
450000.00     19954
135000.00     18720
Name: AMT_CREDIT, dtype: int64
In [393]:
(df_prev['AMT_CREDIT'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
Out[393]:
0.00        20.16
45000.00     2.10
225000.00    1.26
450000.00    1.19
135000.00    1.12
Name: AMT_CREDIT, dtype: float64
In [394]:
df_prev.AMT_CREDIT.quantile(q = [0.25,0.5,0.75,0.99,1])
Out[394]:
0.25     24160.50
0.50     80541.00
0.75    216418.50
0.99   1515377.70
1.00   6905160.00
Name: AMT_CREDIT, dtype: float64
In [395]:
fn_dist_box(df_prev,'AMT_CREDIT')
No description has been provided for this image

Calculating IQR (Inter Quartile range)

In [396]:
Q1 = df_prev['AMT_CREDIT'].quantile(0.25)
Q3 = df_prev['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
192258.0

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [397]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -264226.5
Max value after which outlier exist: 504805.5

Observations

  • AMT_CREDIT values above 504805.5 are outliers

Merge datasets df and df_prev into df_merge¶

In [398]:
df_merge = df.merge(df_prev, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
In [399]:
df.shape
Out[399]:
(307511, 81)
In [400]:
df_prev.shape
Out[400]:
(1670214, 26)
In [401]:
df_merge.shape
Out[401]:
(1413701, 106)

Check structure of the data¶

In [402]:
df_merge.info(verbose = True,null_counts = True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1413701 entries, 0 to 1413700
Data columns (total 106 columns):
 #    Column                        Non-Null Count    Dtype   
---   ------                        --------------    -----   
 0    SK_ID_CURR                    1413701 non-null  int64   
 1    TARGET                        1413701 non-null  int64   
 2    NAME_CONTRACT_TYPE_x          1413701 non-null  object  
 3    CODE_GENDER                   1413701 non-null  object  
 4    FLAG_OWN_CAR                  1413701 non-null  object  
 5    FLAG_OWN_REALTY               1413701 non-null  object  
 6    CNT_CHILDREN                  1413701 non-null  int64   
 7    AMT_INCOME_TOTAL              1413701 non-null  float64 
 8    AMT_CREDIT_x                  1413701 non-null  float64 
 9    AMT_ANNUITY_x                 1413608 non-null  float64 
 10   AMT_GOODS_PRICE_x             1412493 non-null  float64 
 11   NAME_TYPE_SUITE               1410175 non-null  object  
 12   NAME_INCOME_TYPE              1413701 non-null  object  
 13   NAME_EDUCATION_TYPE           1413701 non-null  object  
 14   NAME_FAMILY_STATUS            1413701 non-null  object  
 15   NAME_HOUSING_TYPE             1413701 non-null  object  
 16   REGION_POPULATION_RELATIVE    1413701 non-null  float64 
 17   DAYS_BIRTH                    1413701 non-null  int64   
 18   DAYS_EMPLOYED                 1413701 non-null  int64   
 19   DAYS_REGISTRATION             1413701 non-null  float64 
 20   DAYS_ID_PUBLISH               1413701 non-null  int64   
 21   FLAG_MOBIL                    1413701 non-null  int64   
 22   FLAG_EMP_PHONE                1413701 non-null  int64   
 23   FLAG_WORK_PHONE               1413701 non-null  int64   
 24   FLAG_CONT_MOBILE              1413701 non-null  int64   
 25   FLAG_PHONE                    1413701 non-null  int64   
 26   FLAG_EMAIL                    1413701 non-null  int64   
 27   OCCUPATION_TYPE               1413701 non-null  object  
 28   CNT_FAM_MEMBERS               1413701 non-null  float64 
 29   REGION_RATING_CLIENT          1413701 non-null  int64   
 30   REGION_RATING_CLIENT_W_CITY   1413701 non-null  int64   
 31   WEEKDAY_APPR_PROCESS_START_x  1413701 non-null  object  
 32   HOUR_APPR_PROCESS_START_x     1413701 non-null  int64   
 33   REG_REGION_NOT_LIVE_REGION    1413701 non-null  int64   
 34   REG_REGION_NOT_WORK_REGION    1413701 non-null  int64   
 35   LIVE_REGION_NOT_WORK_REGION   1413701 non-null  int64   
 36   REG_CITY_NOT_LIVE_CITY        1413701 non-null  int64   
 37   REG_CITY_NOT_WORK_CITY        1413701 non-null  int64   
 38   LIVE_CITY_NOT_WORK_CITY       1413701 non-null  int64   
 39   ORGANIZATION_TYPE             1413701 non-null  object  
 40   EXT_SOURCE_2                  1411880 non-null  float64 
 41   EXT_SOURCE_3                  1171859 non-null  float64 
 42   OBS_30_CNT_SOCIAL_CIRCLE      1410555 non-null  float64 
 43   DEF_30_CNT_SOCIAL_CIRCLE      1410555 non-null  float64 
 44   OBS_60_CNT_SOCIAL_CIRCLE      1410555 non-null  float64 
 45   DEF_60_CNT_SOCIAL_CIRCLE      1410555 non-null  float64 
 46   DAYS_LAST_PHONE_CHANGE        1413701 non-null  float64 
 47   FLAG_DOCUMENT_2               1413701 non-null  int64   
 48   FLAG_DOCUMENT_3               1413701 non-null  int64   
 49   FLAG_DOCUMENT_4               1413701 non-null  int64   
 50   FLAG_DOCUMENT_5               1413701 non-null  int64   
 51   FLAG_DOCUMENT_6               1413701 non-null  int64   
 52   FLAG_DOCUMENT_7               1413701 non-null  int64   
 53   FLAG_DOCUMENT_8               1413701 non-null  int64   
 54   FLAG_DOCUMENT_9               1413701 non-null  int64   
 55   FLAG_DOCUMENT_10              1413701 non-null  int64   
 56   FLAG_DOCUMENT_11              1413701 non-null  int64   
 57   FLAG_DOCUMENT_12              1413701 non-null  int64   
 58   FLAG_DOCUMENT_13              1413701 non-null  int64   
 59   FLAG_DOCUMENT_14              1413701 non-null  int64   
 60   FLAG_DOCUMENT_15              1413701 non-null  int64   
 61   FLAG_DOCUMENT_16              1413701 non-null  int64   
 62   FLAG_DOCUMENT_17              1413701 non-null  int64   
 63   FLAG_DOCUMENT_18              1413701 non-null  int64   
 64   FLAG_DOCUMENT_19              1413701 non-null  int64   
 65   FLAG_DOCUMENT_20              1413701 non-null  int64   
 66   FLAG_DOCUMENT_21              1413701 non-null  int64   
 67   AMT_REQ_CREDIT_BUREAU_HOUR    1250074 non-null  float64 
 68   AMT_REQ_CREDIT_BUREAU_DAY     1250074 non-null  float64 
 69   AMT_REQ_CREDIT_BUREAU_WEEK    1250074 non-null  float64 
 70   AMT_REQ_CREDIT_BUREAU_MON     1250074 non-null  float64 
 71   AMT_REQ_CREDIT_BUREAU_QRT     1250074 non-null  float64 
 72   AMT_REQ_CREDIT_BUREAU_YEAR    1250074 non-null  float64 
 73   YEARS_BIRTH                   1413701 non-null  int64   
 74   YEARS_EMPLOYED                1413701 non-null  int64   
 75   YEARS_REGISTRATION            1413701 non-null  int64   
 76   YEARS_ID_PUBLISH              1413701 non-null  int64   
 77   YEARS_LAST_PHONE_CHANGE       1413701 non-null  float64 
 78   AMT_GOODS_PRICE_CATEGORY      1412493 non-null  category
 79   YEARS_BIRTH_CATEGORY          1413701 non-null  object  
 80   YEARS_REGISTRATION_CATEGORY   1413701 non-null  object  
 81   SK_ID_PREV                    1413701 non-null  int64   
 82   NAME_CONTRACT_TYPE_y          1413701 non-null  object  
 83   AMT_ANNUITY_y                 1106483 non-null  float64 
 84   AMT_APPLICATION               1413701 non-null  float64 
 85   AMT_CREDIT_y                  1413700 non-null  float64 
 86   AMT_GOODS_PRICE_y             1094176 non-null  float64 
 87   WEEKDAY_APPR_PROCESS_START_y  1413701 non-null  object  
 88   HOUR_APPR_PROCESS_START_y     1413701 non-null  int64   
 89   FLAG_LAST_APPL_PER_CONTRACT   1413701 non-null  object  
 90   NFLAG_LAST_APPL_IN_DAY        1413701 non-null  int64   
 91   NAME_CASH_LOAN_PURPOSE        1413701 non-null  object  
 92   NAME_CONTRACT_STATUS          1413701 non-null  object  
 93   DAYS_DECISION                 1413701 non-null  int64   
 94   NAME_PAYMENT_TYPE             1413701 non-null  object  
 95   CODE_REJECT_REASON            1413701 non-null  object  
 96   NAME_CLIENT_TYPE              1413701 non-null  object  
 97   NAME_GOODS_CATEGORY           1413701 non-null  object  
 98   NAME_PORTFOLIO                1413701 non-null  object  
 99   NAME_PRODUCT_TYPE             1413701 non-null  object  
 100  CHANNEL_TYPE                  1413701 non-null  object  
 101  SELLERPLACE_AREA              1413701 non-null  int64   
 102  NAME_SELLER_INDUSTRY          1413701 non-null  object  
 103  CNT_PAYMENT                   1106488 non-null  float64 
 104  NAME_YIELD_GROUP              1413701 non-null  object  
 105  PRODUCT_COMBINATION           1413388 non-null  object  
dtypes: category(1), float64(26), int64(50), object(29)
memory usage: 1.1+ GB

We do not see any columns with Nullable values

In [403]:
df_merge.shape
Out[403]:
(1413701, 106)

There are ~1.41 million rows and 106 columns

In [404]:
df_merge[df_merge['SK_ID_CURR'] == 265681][['AMT_CREDIT_x','AMT_CREDIT_y']].head()
Out[404]:
AMT_CREDIT_x AMT_CREDIT_y
654385 278460.00 109552.50
654386 278460.00 22234.50
654387 278460.00 225000.00
654388 278460.00 118989.00
654389 278460.00 337824.00
In [405]:
df[df['SK_ID_CURR'] == 265681][['AMT_CREDIT']].head()
Out[405]:
AMT_CREDIT
142895 278460.00

Get statistical summary for numerical variables¶

In [406]:
df_merge.describe()
Out[406]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT_x AMT_ANNUITY_x AMT_GOODS_PRICE_x REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START_x REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEARS_BIRTH YEARS_EMPLOYED YEARS_REGISTRATION YEARS_ID_PUBLISH YEARS_LAST_PHONE_CHANGE SK_ID_PREV AMT_ANNUITY_y AMT_APPLICATION AMT_CREDIT_y AMT_GOODS_PRICE_y HOUR_APPR_PROCESS_START_y NFLAG_LAST_APPL_IN_DAY DAYS_DECISION SELLERPLACE_AREA CNT_PAYMENT
count 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413608.00 1412493.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1411880.00 1171859.00 1410555.00 1410555.00 1410555.00 1410555.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1250074.00 1250074.00 1250074.00 1250074.00 1250074.00 1250074.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1413701.00 1106483.00 1413701.00 1413700.00 1094176.00 1413701.00 1413701.00 1413701.00 1413701.00 1106488.00
mean 278481.29 0.09 0.40 173316.04 587553.67 27017.02 527718.59 0.02 16321.05 72663.47 5003.23 3034.56 1.00 0.81 0.18 1.00 0.28 0.07 2.15 2.06 2.04 11.98 0.01 0.04 0.04 0.07 0.22 0.17 0.51 0.49 1.54 0.15 1.53 0.11 1084.70 0.00 0.74 0.00 0.01 0.09 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.04 0.30 0.36 2.91 44.72 199.14 13.70 8.31 2.97 1922744.33 15837.18 175243.59 196354.09 226451.19 12.48 1.00 880.37 314.99 16.05
std 102811.79 0.28 0.72 198573.37 384917.33 13951.16 353246.49 0.01 4344.56 143337.36 3551.05 1507.38 0.00 0.40 0.39 0.03 0.45 0.26 0.90 0.50 0.49 3.23 0.11 0.21 0.19 0.26 0.42 0.38 0.19 0.20 2.53 0.47 2.51 0.38 799.94 0.01 0.44 0.01 0.12 0.29 0.01 0.26 0.06 0.01 0.04 0.00 0.04 0.04 0.03 0.08 0.01 0.06 0.02 0.02 0.01 0.08 0.11 0.21 0.98 0.93 2.20 11.91 392.84 9.74 4.15 2.21 532715.34 14724.91 293622.19 319481.25 315937.60 3.33 0.06 783.54 7695.08 14.57
min 100002.00 0.00 0.00 25650.00 45000.00 1615.50 40500.00 0.00 7489.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21.00 0.00 0.00 0.00 0.00 1000001.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 -1.00 0.00
25% 189364.00 0.00 0.00 112500.00 270000.00 16821.00 238500.00 0.01 12739.00 1042.00 2001.00 1783.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.39 0.34 0.00 0.00 0.00 0.00 396.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 35.00 3.00 5.00 5.00 1.00 1461346.00 6257.86 19750.50 24880.50 49828.50 10.00 1.00 271.00 -1.00 6.00
50% 278992.00 0.00 0.00 157500.00 508495.50 24925.50 450000.00 0.02 16044.00 2401.00 4508.00 3330.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.56 0.51 0.00 0.00 0.00 0.00 1011.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 44.00 7.00 12.00 9.00 3.00 1922698.00 11223.76 70870.50 80595.00 110245.50 12.00 1.00 582.00 4.00 12.00
75% 367556.00 0.00 1.00 207000.00 807984.00 34542.00 679500.00 0.03 19980.00 6313.00 7510.00 4319.00 1.00 1.00 0.00 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 0.65 2.00 0.00 2.00 0.00 1683.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 55.00 17.00 21.00 12.00 5.00 2384012.00 20420.78 180000.00 215640.00 229500.00 15.00 1.00 1313.00 85.00 24.00
max 456255.00 1.00 19.00 117000000.00 4050000.00 225000.00 4050000.00 0.07 25201.00 365243.00 24672.00 7197.00 1.00 1.00 1.00 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.85 0.90 348.00 34.00 344.00 24.00 4292.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00 69.00 1001.00 68.00 20.00 12.00 2845381.00 418058.15 5850000.00 4509688.50 5850000.00 23.00 1.00 2922.00 4000000.00 84.00

Analyzing categorical variables¶

In [407]:
df_merge.select_dtypes(include = "object").columns
Out[407]:
Index(['NAME_CONTRACT_TYPE_x', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START_x', 'ORGANIZATION_TYPE',
       'YEARS_BIRTH_CATEGORY', 'YEARS_REGISTRATION_CATEGORY',
       'NAME_CONTRACT_TYPE_y', 'WEEKDAY_APPR_PROCESS_START_y',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
       'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO',
       'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'NAME_SELLER_INDUSTRY',
       'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
      dtype='object')
In [408]:
# Checking number of categorical variables
len(df_merge.select_dtypes(include = "object").columns)
Out[408]:
29

There are 29 categorical variables

Analyzing numerical variables¶

In [409]:
df_merge.select_dtypes(include=["int64","float64"]).columns
Out[409]:
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START_x',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
       'YEARS_BIRTH', 'YEARS_EMPLOYED', 'YEARS_REGISTRATION',
       'YEARS_ID_PUBLISH', 'YEARS_LAST_PHONE_CHANGE', 'SK_ID_PREV',
       'AMT_ANNUITY_y', 'AMT_APPLICATION', 'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y',
       'HOUR_APPR_PROCESS_START_y', 'NFLAG_LAST_APPL_IN_DAY', 'DAYS_DECISION',
       'SELLERPLACE_AREA', 'CNT_PAYMENT'],
      dtype='object')
In [410]:
# Checking number of categorical variables
len(df_merge.select_dtypes(include=["int64","float64"]).columns)
Out[410]:
76

There are 76 numerical variables

Univariate analysis of categorical variables¶

Analysis of NAME_CONTRACT_STATUS¶

In [411]:
df_merge['NAME_CONTRACT_STATUS'].value_counts().sort_values(ascending = False)
Out[411]:
Approved        886099
Canceled        259441
Refused         245390
Unused offer     22771
Name: NAME_CONTRACT_STATUS, dtype: int64
In [412]:
fn_uni_countplot_merge('NAME_CONTRACT_STATUS')
No description has been provided for this image
In [413]:
fn_uni_piechart_merge('NAME_CONTRACT_STATUS')
No description has been provided for this image

Observations

  • Approved loan status is the highest among all loan applications
  • Canceled loan status is the second highest among all loan applications

Analysis of NAME_CLIENT_TYPE¶

In [414]:
df_merge['NAME_CLIENT_TYPE'].value_counts().sort_values(ascending = False)
Out[414]:
Repeater     1039225
New           259540
Refreshed     114936
Name: NAME_CLIENT_TYPE, dtype: int64
In [415]:
fn_uni_countplot_merge('NAME_CLIENT_TYPE')
No description has been provided for this image
In [416]:
fn_uni_piechart_merge('NAME_CLIENT_TYPE')
No description has been provided for this image

Observations

  • Repeater client type is the highest among all loan applications
  • New client type is the second highest among all loan applications

Analysis of CHANNEL_TYPE¶

In [417]:
df_merge['CHANNEL_TYPE'].value_counts().sort_values(ascending = False)
Out[417]:
Credit and cash offices       600342
Country-wide                  423062
Stone                         183508
Regional / Local               92472
Contact center                 58891
AP+ (Cash loan)                49624
Channel of corporate sales      5396
Car dealer                       406
Name: CHANNEL_TYPE, dtype: int64
In [418]:
fn_uni_countplot_merge('CHANNEL_TYPE')
No description has been provided for this image
In [419]:
fn_uni_piechart_merge('CHANNEL_TYPE')
No description has been provided for this image

Observations

  • Country-wide Channel type is the highest among all loan applications
  • Credit and cash offices is the second highest Channel Type among all loan applications

Analysis of NAME_YIELD_GROUP¶

In [420]:
df_merge['NAME_YIELD_GROUP'].value_counts().sort_values(ascending = False)
Out[420]:
XNA           429198
middle        324971
high          306061
low_normal    274884
low_action     78587
Name: NAME_YIELD_GROUP, dtype: int64
In [421]:
fn_uni_countplot_merge('NAME_YIELD_GROUP')
No description has been provided for this image
In [422]:
fn_uni_piechart_merge('NAME_YIELD_GROUP')
No description has been provided for this image

Observations

  • XNA interest rate is the highest among all loan applications
  • middle and high interest rates are the second and third highest among all loan applications

Analysis of NAME_GOODS_CATEGORY¶

In [423]:
df_merge['NAME_GOODS_CATEGORY'].value_counts().sort_values(ascending = False).head()
Out[423]:
XNA                     797209
Mobile                  193743
Consumer Electronics    104785
Computers                89923
Audio/Video              85502
Name: NAME_GOODS_CATEGORY, dtype: int64
In [424]:
fn_uni_countplot_merge('NAME_GOODS_CATEGORY')
No description has been provided for this image
In [425]:
fn_uni_piechart_merge('NAME_GOODS_CATEGORY')
No description has been provided for this image

Observations

  • XNA goods category is the highest among all loan applications
  • mobile goods categoryis the second highest among all loan applications

Univariate analysis of numerical variables¶

Analysis of AMT_APPLICATION¶

Outlier identification of AMT_APPLICATION¶

Calculating IQR (Inter Quartile range)

In [426]:
Q1 = df_merge['AMT_APPLICATION'].quantile(0.25)
Q3 = df_merge['AMT_APPLICATION'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
160249.5

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [427]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -220623.75
Max value after which outlier exist: 420374.25

Removing outliers and plotting distplot

In [428]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_APPLICATION'] <= Max_value].AMT_APPLICATION, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
No description has been provided for this image

Observations

  • Most of the loan amount applied by the clients initially seems to be very small as can be seen from the huge spike at the beginning of the distribution

Analysis of AMT_ANNUITY_y¶

Outlier identification of AMT_ANNUITY_y¶

Calculating IQR (Inter Quartile range)

In [429]:
Q1 = df_merge['AMT_ANNUITY_y'].quantile(0.25)
Q3 = df_merge['AMT_ANNUITY_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
14162.917500000001

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [430]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -14986.518750000001
Max value after which outlier exist: 41665.15125

Removing outliers and plotting distplot

In [431]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_ANNUITY_y'] <= Max_value].AMT_ANNUITY_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
No description has been provided for this image

Observations

  • Most of the previous loan's annuity from the clients is less than 10,000 as the distribution is high here
  • As previous loan's annuity increases, the no. of clients decreases

Analysis of AMT_CREDIT_y¶

Outlier identification of AMT_CREDIT_y¶

Calculating IQR (Inter Quartile range)

In [432]:
Q1 = df_merge['AMT_CREDIT_y'].quantile(0.25)
Q3 = df_merge['AMT_CREDIT_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
190759.5

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [433]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -261258.75
Max value after which outlier exist: 501779.25

Removing outliers and plotting distplot

In [434]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_CREDIT_y'] <= Max_value].AMT_CREDIT_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
No description has been provided for this image

Observations

  • This distribution very closely resembles that of AMT_APPLICATION. This means that most people received the loan amount that they applied for

Analysis of AMT_GOODS_PRICE_y¶

Outlier identification of AMT_GOODS_PRICE_y¶

Calculating IQR (Inter Quartile range)

In [435]:
Q1 = df_merge['AMT_GOODS_PRICE_y'].quantile(0.25)
Q3 = df_merge['AMT_GOODS_PRICE_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
179671.5

Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers

In [436]:
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -219678.75
Max value after which outlier exist: 499007.25

Removing outliers and plotting distplot

In [437]:
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_GOODS_PRICE_y'] <= Max_value].AMT_GOODS_PRICE_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
No description has been provided for this image

Observations

  • Most of the goods price asked by clients in previous application is less than 100K

Correlation analysis of numerical variables¶

Plotting correlation matrix¶

In [438]:
corr_df = df_merge[['AMT_ANNUITY_x', 'AMT_APPLICATION','AMT_CREDIT_x', 'AMT_GOODS_PRICE_x',
                    'AMT_ANNUITY_y', 'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y', 'CNT_PAYMENT']].corr()
In [439]:
corr_df.head()
Out[439]:
AMT_ANNUITY_x AMT_APPLICATION AMT_CREDIT_x AMT_GOODS_PRICE_x AMT_ANNUITY_y AMT_CREDIT_y AMT_GOODS_PRICE_y CNT_PAYMENT
AMT_ANNUITY_x 1.00 0.11 0.76 0.76 0.20 0.11 0.13 -0.01
AMT_APPLICATION 0.11 1.00 0.12 0.12 0.81 0.98 1.00 0.68
AMT_CREDIT_x 0.76 0.12 1.00 0.99 0.15 0.12 0.14 0.04
AMT_GOODS_PRICE_x 0.76 0.12 0.99 1.00 0.16 0.12 0.14 0.04
AMT_ANNUITY_y 0.20 0.81 0.15 0.16 1.00 0.82 0.82 0.40

Creating a HeatMap

In [440]:
plt.figure(figsize = (20,8))
sns.heatmap(data = corr_df, annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()
No description has been provided for this image

Observations

  • AMT_APPLICATION has a high correlation with AMT_ANNUITY_y,AMT_CREDIT_y,AMT_GOODS_PRICE_y and decent correlation with CNT_PAYMENT
  • AMT_GOODS_PRICE_y has a high correlation with AMT_ANNUITY_y,AMT_CREDIT_y,AMT_APPLICATION and decent correlation with CNT_PAYMENT
  • AMT_CREDIT_y has a high correlation with AMT_GOODS_PRICE_y and decent correlation with CNT_PAYMENT
  • AMT_ANNUITY_x has a high correlation with AMT_GOODS_PRICE_y,AMT_CREDIT_y
  • AMT_ANNUITY_x has a high correlation with AMT_GOODS_PRICE_x,AMT_CREDIT_x
  • AMT_CREDIT_x has a high correlation with AMT_GOODS_PRICE_x

Bivariate/Multivariate analysis¶

Continuous V/S Continuous variables¶

Analysis of AMT_GOODS_PRICE_y V/S AMT_CREDIT_y V/S NAME_CONTRACT_STATUS¶

Outlier identification of AMT_GOODS_PRICE_y

In [441]:
max_value1_AMT_GOODS_PRICE_y = outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value1_AMT_GOODS_PRICE_y
Out[441]:
499007.25

Outlier identification of AMT_CREDIT_y

In [442]:
max_value1_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y
Out[442]:
501779.25

Plotting scatterplot

In [443]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_GOODS_PRICE_y'] < max_value1_AMT_GOODS_PRICE_y].AMT_GOODS_PRICE_y, 
                y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • At lower levels of previous application's Goods price < 200K and Credit > 300k, have a chance of getting refused. However, this is a weak correlation as we have less data points to support this

Analysis of AMT_ANNUITY_y V/S AMT_CREDIT_y V/S NAME_CONTRACT_STATUS¶

Outlier identification of AMT_ANNUITY_y

In [444]:
max_value1_AMT_ANNUITY_y = outlier_range(df_merge,'AMT_ANNUITY_y')
max_value1_AMT_ANNUITY_y
Out[444]:
41665.15125

Outlier identification of AMT_CREDIT_y

In [445]:
max_value1_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y
Out[445]:
501779.25

Plotting scatterplot

In [446]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_ANNUITY_y'] < max_value1_AMT_ANNUITY_y].AMT_ANNUITY_y, 
                y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • There are lots of refusal observations with Annuity amount < 10000 and Credit amount > ~250K. This might be because higher credit amount should also require higher Annuity from Client to pay it

Analysis of AMT_APPLICATION V/S AMT_GOODS_PRICE_y V/S NAME_CONTRACT_STATUS¶

Outlier identification of AMT_APPLICATION

In [447]:
max_value1_AMT_APPLICATION = outlier_range(df_merge,'AMT_APPLICATION')
max_value1_AMT_APPLICATION
Out[447]:
420374.25

Outlier identification of AMT_GOODS_PRICE_y

In [448]:
max_value1_AMT_GOODS_PRICE_y = outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value1_AMT_GOODS_PRICE_y
Out[448]:
499007.25

Plotting scatterplot

In [449]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_APPLICATION'] < max_value1_AMT_APPLICATION].AMT_APPLICATION, 
                y = df_merge[df_merge['AMT_GOODS_PRICE_y'] < max_value1_AMT_GOODS_PRICE_y].AMT_GOODS_PRICE_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • Application amount has strong positive correlation with Goods price

Analysis of AMT_APPLICATION V/S AMT_CREDIT_y V/S NAME_CONTRACT_STATUS¶

Outlier identification of AMT_APPLICATION

In [450]:
max_value1_AMT_APPLICATION = outlier_range(df_merge,'AMT_APPLICATION')
max_value1_AMT_APPLICATION
Out[450]:
420374.25

Outlier identification of AMT_CREDIT_y

In [451]:
max_value1_AMT_CREDIT_y= outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y
Out[451]:
501779.25

Plotting scatterplot

In [452]:
plt.figure(figsize = [20,8])

sns.scatterplot(x = df_merge[df_merge['AMT_APPLICATION'] < max_value1_AMT_APPLICATION].AMT_APPLICATION, 
                y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
                data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout(pad = 4)
plt.show()
No description has been provided for this image

Observations

  • Application amount has strong positive correlation with Credit amount

Continuous V/S Categorical variables¶

Analysis of NAME_CONTRACT_STATUS V/S AMT_CREDIT_y V/S CODE_GENDER¶

Outlier identification of AMT_CREDIT_y

In [453]:
max_value_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value_AMT_CREDIT_y
Out[453]:
501779.25

Client with Payment difficulties

In [454]:
df_merge.groupby(by = ['NAME_CONTRACT_STATUS','CODE_GENDER']).AMT_CREDIT_y.describe().head()
Out[454]:
count mean std min 25% 50% 75% max
NAME_CONTRACT_STATUS CODE_GENDER
Approved F 598648.00 207549.31 279373.33 0.00 48078.00 104935.50 229500.00 4509688.50
M 287450.00 189288.80 267951.42 0.00 45430.62 94140.00 197820.00 4050000.00
Canceled F 179044.00 24215.18 160577.58 0.00 0.00 0.00 0.00 3847104.00
M 80397.00 26053.06 174648.76 0.00 0.00 0.00 0.00 3524220.00
Refused F 163327.00 376531.00 458660.51 0.00 66159.00 202500.00 513531.00 4085550.00
In [455]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_CREDIT_y',max_value_AMT_CREDIT_y,'CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who are Refused and Female apply for higher median credit amount than Male

Analysis of NAME_CONTRACT_STATUS V/S AMT_ANNUITY_y V/S CODE_GENDER¶

Outlier identification of AMT_ANNUITY_y

In [456]:
max_value_AMT_ANNUITY_y= outlier_range(df_merge,'AMT_ANNUITY_y')
max_value_AMT_ANNUITY_y
Out[456]:
41665.15125

Client with Payment difficulties

In [457]:
df_merge.groupby(by = ['NAME_CONTRACT_STATUS','CODE_GENDER']).AMT_ANNUITY_y.describe().head()
Out[457]:
count mean std min 25% 50% 75% max
NAME_CONTRACT_STATUS CODE_GENDER
Approved F 598644.00 14660.85 13624.36 0.00 5927.98 10424.61 18314.93 393868.66
M 287448.00 14513.24 13996.13 0.00 5790.79 10010.41 17900.91 229274.95
Canceled F 6073.00 28917.04 17584.22 1324.31 16236.18 25409.74 38133.00 210950.86
M 2687.00 33066.86 19936.90 1780.88 19257.17 29550.60 43799.47 218713.55
Refused F 141148.00 20318.15 16580.62 0.00 8765.83 15750.00 27512.04 418058.15
In [458]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_ANNUITY_y',max_value_AMT_ANNUITY_y,'CODE_GENDER')
No description has been provided for this image

Observations

  • Clients who got Cancelled and Male paid higher median Annuity than Female
  • Clients who got Refused and Female paid higher median Annuity than Male

Analysis of NAME_CLIENT_TYPE V/S AMT_GOODS_PRICE_y V/S NAME_CONTRACT_STATUS¶

Outlier identification of AMT_GOODS_PRICE_y

In [459]:
max_value_AMT_GOODS_PRICE_y= outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value_AMT_GOODS_PRICE_y
Out[459]:
499007.25

Client with Payment difficulties

In [460]:
df_merge.groupby(by = ['NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS']).AMT_GOODS_PRICE_y.describe().head()
Out[460]:
count mean std min 25% 50% 75% max
NAME_CLIENT_TYPE NAME_CONTRACT_STATUS
New Approved 242297.00 103479.31 137367.96 0.00 38106.00 67500.00 121500.00 5850000.00
Canceled 186.00 481775.18 576602.84 22455.00 135000.00 225000.00 675000.00 3150000.00
Refused 12219.00 169230.61 280373.28 0.00 45000.00 97452.00 180000.00 4050000.00
Unused offer 1867.00 79936.32 61318.33 0.00 39325.50 66758.85 104231.25 688059.00
Refreshed Approved 79838.00 188335.88 245136.33 0.00 54859.50 106469.89 219150.00 3150000.00
In [461]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_GOODS_PRICE_y',max_value_AMT_GOODS_PRICE_y,'NAME_CLIENT_TYPE')
No description has been provided for this image

Observations

  • Clients who are New and Canceledhave less median goods price compared to Repeater and Refreshed
  • Clients who are Approved and New have less median goods price compared to Repeater and Refreshed

Analysis of NAME_CONTRACT_STATUS V/S AMT_CREDIT_y V/S NAME_PORTFOLIO¶

Outlier identification of AMT_CREDIT_y

In [462]:
max_value_AMT_CREDIT_y= outlier_range(df_merge,'AMT_CREDIT_y')
max_value_AMT_CREDIT_y
Out[462]:
501779.25

Client with Payment difficulties

In [463]:
df_merge.groupby(by = ['NAME_PORTFOLIO','NAME_CONTRACT_STATUS']).AMT_CREDIT_y.describe().head()
Out[463]:
count mean std min 25% 50% 75% max
NAME_PORTFOLIO NAME_CONTRACT_STATUS
Cards Approved 82407.00 227605.36 215931.33 0.00 67500.00 157500.00 292500.00 1350000.00
Canceled 417.00 228733.81 186123.23 45000.00 112500.00 180000.00 270000.00 900000.00
Refused 39159.00 239762.70 207093.85 0.00 90000.00 180000.00 270000.00 2250000.00
Unused offer 1.00 202500.00 NaN 202500.00 202500.00 202500.00 202500.00 202500.00
Cars Approved 241.00 1122764.27 644748.22 202500.00 735520.50 980100.00 1293556.50 4509688.50
In [464]:
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_CREDIT_y',max_value_AMT_CREDIT_y,'NAME_PORTFOLIO')
No description has been provided for this image

Observations

  • Clients who have Unused offer receive more median credit in POS portfolio
  • Clients who are Refused receive more median credit in Cash portfolio
  • Clients who are Approved receive more median credit in Cars portfolio

Categorical V/S Categorical variables¶

Analysis of YEARS_BIRTH_CATEGORY V/S NAME_CONTRACT_STATUS¶

In [465]:
fn_bi_countplot_merge('YEARS_BIRTH_CATEGORY','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observations

  • Clients who are in the age range 30-40 get most approval followed by clients in 40-50 age range
  • Clients who are in the age range 60-70 receive least refusals followed by 20-30 age range

Analysis of NAME_FAMILY_STATUS V/S NAME_CONTRACT_STATUS¶

In [466]:
fn_bi_countplot_merge('NAME_FAMILY_STATUS','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observations

  • Clients who are Married receive the most approvals

Analysis of NAME_EDUCATION_TYPE V/S NAME_CONTRACT_STATUS¶

In [467]:
fn_bi_countplot_merge('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observations

  • Clients who have Secondary/secondary special receive the most approvals

Analysis of NAME_CLIENT_TYPE V/S NAME_CONTRACT_STATUS¶

In [468]:
fn_bi_countplot_merge('NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observations

  • Clients who are Repeaters receive the most approvals followed by New

Conclusion: Client categories to be targeted for providing loan

  • Clients who are employed for more than 19 years
  • Clients in the age range 30-40 and 40-50
  • Clients who are Married
  • Male clients with Academic degree
  • Students and Businessman
  • Repeater clients